Matcher with `"SENT_START": False` works differently with Sentencizer vs. dependency parser #5287

peter-exos · 2020-04-09T19:37:18Z

How to reproduce the behaviour

Matcher with a token pattern containing "SENT_START": False gives the intended result with sentence boundaries set by the parser, but not if sentence boundaries are set by the Sentencizer.

Say we want to identify (and merge) sequences of tokens in Title Case, but obviously we do not want to include the first token in a sentence. Then the following token pattern works with spaCy version >= 2.2.4

import spacy
from spacy.pipeline.pipes import Sentencizer
from spacy.pipeline import EntityRuler

prop_patterns = [
    {   # Proper names come in title case
        "label": "ProperName",
        "pattern": [ 
            {"IS_TITLE": True, "IS_SENT_START": False},
            {"IS_TITLE": True, "OP": "+"}]
    }]

nlp = spacy.load('en_core_web_sm', disable=['ner'])  # disabling NER here to see effect of Matcher
nlp.add_pipe(EntityRuler(nlp, patterns=prop_patterns, validate=True))
nlp.add_pipe(nlp.create_pipe('merge_entities'))

text = "In Europe, the European Central Bank is trying to calm the markets. But the crisis could last for months."
doc = nlp(text)
for sent in doc.sents:
    print([tok.text for tok in sent])

printing

['In', 'Europe', ',', 'the', 'European Central Bank', 'is', 'trying', 'to', 'calm', 'the', 'markets', '.']
['But', 'the', 'crisis', 'could', 'last', 'for', 'months', '.']

as expected.

But when we disable the parser and instead use Sentencizer to assign sentence boundaries, we get

['In', 'Europe', ',', 'the', 'European', 'Central', 'Bank', 'is', 'trying', 'to', 'calm', 'the', 'markets', '.']
['But', 'the', 'crisis', 'could', 'last', 'for', 'months', '.']

where European Central Bank is not merged as an entity (full code is listed at the bottom).

Possible causes, solutions, and (currently impossible) workarounds

When a token is not at the beginning of a sentence, the dependency parser does not actually set token.is_sent_start (leaving it as None), while Sentencizer sets it to False.
(How exactly this leads to the difference in matching behavior is not 100% clear to me).

Trivalent sentence boundary marking and assigning `is_sentenced`

The behavior of the dependency parser is probably the better option, because with its trivalent logic multiple sentence segmentation components could complement each other in a given pipeline.

But just removing the following two lines

                    else:
                        doc.c[j].sent_start = -1

from Sentencizer.set_annotations would cause problems with Docs consisting of a single sentence: given how it is currently implemented, Doc.is_sentenced would be False.

(Currently impossible?) workarounds

Instead of specifying "SENT_START": False, we could try to write the pattern in terms of != True, which would include both False and None.
Unfortunately, this does not seem possible right now.
The following two approaches yield errors complaining that the complex right-hand side is not of type 'boolean' [0 -> SENT_START]:

{"IS_TITLE": True, "SENT_START": {"NOT_IN": [True]}}

{"IS_TITLE": True, "IS_SENT_START": {"!=": True}}

(The latter issue may be related to #5281).

Finally, I tried writing the pattern like

prop_patterns = [
    {   # Proper names come in title case
        "label": "ProperName",
        "pattern": [ 
            {"SENT_START": True, "OP": "!"},
            {"IS_TITLE": True, "OP": "+"}]
    }]

and the results were wild: in the above example it will (expectedly) yield the merged token the European Central Bank; but when a sentence starts with multiple tokens in Title Case, sentence boundaries get messed up (again expectedly: merging a match involving a sentence boundary invalidates the Spans marking sentences).

The full code to reproduce the second output above (using IS_SENT_START instead of SENT_START yields the same pattern):

nlq = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlq.add_pipe(Sentencizer())
nlq.add_pipe(EntityRuler(nlq, patterns=prop_patterns, validate=True))
nlq.add_pipe(nlq.create_pipe('merge_entities'))

text = "In Europe, the European Central Bank is trying to calm the markets. But the crisis could last for months."
doc = nlq(text)
for sent in doc.sents:
    print([tok.text for tok in sent])
for tok in doc:
    print(tok.text, repr(tok.is_sent_start))

Your Environment

spaCy version: 2.2.4
Platform: Darwin-18.7.0-x86_64-i386-64bit
Python version: 3.7.7

The text was updated successfully, but these errors were encountered:

svlandeg · 2020-04-10T08:01:37Z

This relates to #4775. I agree that it's confusing that one component sets non-sentence boundaries as None, and the other as False. I think @adrianeboyd is considering the options as this is a tricky issue when you want to combine different pipeline components that would set sentence boundaries in sequence.

As a temporary workaround, have you tried pulling in branch #5282 to check whether that would solve your issue, using != True ? It would be a good test for that PR as well.

peter-exos · 2020-04-10T18:06:26Z

It doesn't look like #5282 solves the issue.

In fact, it looks like a completely different issue.
After pulling #5282 in I still get the same error message:

spacy.errors.MatchPatternError: Invalid token patterns for matcher rule 'ProperName'

Pattern 0:
- {'!=': True} is not of type 'boolean' [0 -> SENT_START]

This is the same error as I got when trying the workarounds mentioned above.

The issue seems to be that any type of extended pattern syntax (like 'IN' or '!=') is not compatible with Boolean attributes, e.g. the following patterns all throw the same errors:

{"IS_TITLE" : {"!=": False}}

{"IS_TITLE" : {"IN": [True]}}

This issue seems to be independent of #5282.

Just to confirm this, here is a self-contained code snippet, which breaks in the last line:

import spacy
from spacy.pipeline import EntityRuler
from spacy.matcher import Matcher
from spacy.tokens import Doc

prop_patterns = [
    {   # Proper names come in title case
        "label": "ProperName",
        "pattern": [ 
            {"IS_TITLE" : {"==": True}},
            {"IS_TITLE": True, "OP": "+"}]
    }]

nlp = spacy.load('en_core_web_sm', disable=['ner'])

matcher = Matcher(nlp.vocab)
pattern = [{"LENGTH": {"!=": 2}}]
matcher.add("LENGTH_COMPARE", [pattern])
doc = Doc(nlp.vocab, words=["a", "aa", "aaa"])
matches = matcher(doc)
assert len(matches) == 2

nlp.add_pipe(EntityRuler(nlp, patterns=prop_patterns, validate=True))

adrianeboyd · 2020-04-24T08:31:58Z

For SENT_START, the Matcher is trying to compare boolean values in the pattern to -1/0/1 values underneath, which unsurprisingly does not really work that well for False == -1. This is only a problem for IS_SENT_START, since the other boolean attributes are actually boolean underneath.

My proposed API is kind of clunky, but I think the easiest short-term solution is to normalize this value to True/False when the Matcher accesses it. I think we probably need to rethink how to store/show the sentence boundary values in the future.

github-actions · 2021-11-06T00:01:37Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / matcher Feature: Token, phrase and dependency matcher feat / parser Feature: Dependency Parser feat / sentencizer Feature: Sentencizer (rule-based sentence segmenter) labels Apr 10, 2020

adrianeboyd added the bug Bugs and behaviour differing from documentation label Apr 24, 2020

adrianeboyd mentioned this issue Apr 24, 2020

Normalize TokenC.sent_start values for Matcher #5346

Merged

3 tasks

honnibal closed this as completed in #5346 Apr 29, 2020

svlandeg mentioned this issue Jul 3, 2020

Handle sentence boundaries from multiple components #4775

Open

github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matcher with `"SENT_START": False` works differently with Sentencizer vs. dependency parser #5287

Matcher with `"SENT_START": False` works differently with Sentencizer vs. dependency parser #5287

peter-exos commented Apr 9, 2020 •

edited

svlandeg commented Apr 10, 2020

peter-exos commented Apr 10, 2020 •

edited

adrianeboyd commented Apr 24, 2020

github-actions bot commented Nov 6, 2021

Matcher with "SENT_START": False works differently with Sentencizer vs. dependency parser #5287

Matcher with "SENT_START": False works differently with Sentencizer vs. dependency parser #5287

Comments

peter-exos commented Apr 9, 2020 • edited

How to reproduce the behaviour

Possible causes, solutions, and (currently impossible) workarounds

Trivalent sentence boundary marking and assigning is_sentenced

(Currently impossible?) workarounds

Your Environment

svlandeg commented Apr 10, 2020

peter-exos commented Apr 10, 2020 • edited

adrianeboyd commented Apr 24, 2020

github-actions bot commented Nov 6, 2021

Matcher with `"SENT_START": False` works differently with Sentencizer vs. dependency parser #5287

Matcher with `"SENT_START": False` works differently with Sentencizer vs. dependency parser #5287

peter-exos commented Apr 9, 2020 •

edited

Trivalent sentence boundary marking and assigning `is_sentenced`

peter-exos commented Apr 10, 2020 •

edited