Handle sentence boundaries from multiple components #4775

adrianeboyd · 2019-12-05T20:26:07Z

Feature description

Decide how to handle is_sentenced and sentence boundaries that may come from multiple components (Sentencizer, SentenceRecognizer, Parser).

Some ideas:

have an is_sentenced property more like is_parsed that can be set by components
have a way to set finalized sentence boundaries (all 0 to -1):
- have an extra option for each component
- have an extra pipeline component (e.g., finalize_sentences?) that can be inserted at the right point in the pipeline
also have a component that resets all sentence boundaries?
modify Sentencizer to only set sentence starts, not all tokens?

Check that no spacy components clobber sentence boundaries and that is_sentenced works consistently when sentence boundaries come from multiple sources. If a component after the parser changes sentence boundaries, make sure the required tree recalculations are done (a related issue: #4497).

Potentially add warnings when non-zero sent_start is changed by any component?

I think the default behavior could be that any pipeline component can add sentence boundaries but that components won't remove any sentence boundaries. The idea would be that the Sentencizer or SentenceRecognizer add punctuation-based boundaries (typically high precision, although the Sentencizer less so) and the Parser can add phrase-based boundaries (improving recall). I don't know if this works as cleanly as envisioned in practice, especially with the Sentencizer. Most likely people using the Sentencizer aren't using other components so it's less of an issue, but I could imagine SentenceRecognizer + Parser as a common combination.

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-02-24T17:26:03Z

Suggestions from @DomHudson in #5050 (comment):

In my opinion the combination of {None, True, False} is not transparent or flexible enough to provide the information that it is currently trying to captured. It is likely to cause problems as it would be entirely reasonable to expect True to indicate a sentence boundary and False otherwise - a clean API should be self-explanatory.

I think the best approach is to have this attribute as a boolean (no None-types allowed) once the sentence boundaries have been set and None-type otherwise. If there is a desire to allow more complex stacking of pipelines and pipeline-units then a more complete history should be kept, for example a SentenceBoundaryBoolean object could be created which mimics True or False but also allows the state of certain tokens to be altered after their initial creation and retains the history of the model that caused the latest change. This would provide much more flexibility and explainability than the limited {True, False, None}.

svlandeg · 2020-07-03T09:18:14Z

Related conversations:

Issue doc.sents doesn't work for some docs restored from persisted DocBin #5578 about DocBin serialization and which attributes to use
Issue is_sent_start None instead of False #5050 about the ternary system (true, false, none) for is_sent_start and how different components should set the values
Issue Matcher with "SENT_START": False works differently with Sentencizer vs. dependency parser #5287 about how the Matcher accesses the "ternary boolean" sentence boundary values

honnibal · 2020-07-06T10:48:51Z

My position on this is mostly "keep it as is". I'm open to debate on this, but I'll explain my position.

I agree that an is_sentenced flag would be good. I'm happy to accommodate that.

I still think the ternary values are the most practical mechanism for allowing components to coordinate on the sentence boundaries. I don't think it would really help to have something like a decision history or something, and that would be impractical for efficiency reasons anyway.

There's ultimately no way for components to know what other components are expected to run before or after them. It's up to the pipeline author to construct a pipeline that behaves well as a whole. It's nice if components are configurable about how they set the sentence boundary values, but that's a question for the design of the individual components. And the pipeline author can always insert other processes that run over the Doc and set the boundaries differently.

I don't think any more complicated mechanism than ternary values would really help components coordinate. Let's say components got to set a single probability instead of a ternary. If you're writing a component and you receive some set probability, how should you interpret it? It will depend on how accurate you expect that model to be on your data, and how accurate you expect the component's own model to be. Only the person who puts together the pipeline is in a position to know how those values should be integrated, so it still can't happen automatically. Similarly, let's say you had a full history of which components had set the is_sent_start values, and what decisions they had made. If you had that, what would you do with it? You still don't know what the correct value should be.

So my position is that components are able to set three values for the is_sent_start attribute on each token: True, False, and None. Components should try to do a good job setting this value, and any component can choose to respect or ignore the previous decisions. Components will be more useful if they tell users what they do and allow that to be configured, but that's ultimately up to the component. And ultimately it's up to the pipeline author to construct a sequence of components that give useful results.

For ourselves as pipeline and component authors, I think the parser could be a bit more configurable. We could expose an option to never insert sentence boundaries, regardless of whether False or None were set. We can currently get that behaviour by setting all the is_sent_start values to False, but that overrides the previous values which might be undesirable. Personally I think this isn't that useful a configuration though, and I don't know what I'd call it.

adrianeboyd added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects feat / parser Feature: Dependency Parser feat / sentencizer Feature: Sentencizer (rule-based sentence segmenter) labels Dec 6, 2019

adrianeboyd mentioned this issue Jan 17, 2020

How do I train sentence splitter without training DEP parser? #4912

Closed

adrianeboyd mentioned this issue Feb 23, 2020

is_sent_start None instead of False #5050

Closed

svlandeg mentioned this issue Apr 10, 2020

Matcher with "SENT_START": False works differently with Sentencizer vs. dependency parser #5287

Closed

adrianeboyd mentioned this issue Jun 12, 2020

doc.sents doesn't work for some docs restored from persisted DocBin #5578

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle sentence boundaries from multiple components #4775

Handle sentence boundaries from multiple components #4775

adrianeboyd commented Dec 5, 2019

adrianeboyd commented Feb 24, 2020

svlandeg commented Jul 3, 2020

honnibal commented Jul 6, 2020

Handle sentence boundaries from multiple components #4775

Handle sentence boundaries from multiple components #4775

Comments

adrianeboyd commented Dec 5, 2019

Feature description

adrianeboyd commented Feb 24, 2020

svlandeg commented Jul 3, 2020

honnibal commented Jul 6, 2020