Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matcher with "SENT_START": False works differently with Sentencizer vs. dependency parser #5287

Closed
peter-exos opened this issue Apr 9, 2020 · 4 comments · Fixed by #5346
Closed
Labels
bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher feat / parser Feature: Dependency Parser feat / sentencizer Feature: Sentencizer (rule-based sentence segmenter)

Comments

@peter-exos
Copy link
Contributor

peter-exos commented Apr 9, 2020

How to reproduce the behaviour

Matcher with a token pattern containing "SENT_START": False gives the intended result with sentence boundaries set by the parser, but not if sentence boundaries are set by the Sentencizer.

Say we want to identify (and merge) sequences of tokens in Title Case, but obviously we do not want to include the first token in a sentence. Then the following token pattern works with spaCy version >= 2.2.4

import spacy
from spacy.pipeline.pipes import Sentencizer
from spacy.pipeline import EntityRuler

prop_patterns = [
    {   # Proper names come in title case
        "label": "ProperName",
        "pattern": [ 
            {"IS_TITLE": True, "IS_SENT_START": False},
            {"IS_TITLE": True, "OP": "+"}]
    }]

nlp = spacy.load('en_core_web_sm', disable=['ner'])  # disabling NER here to see effect of Matcher
nlp.add_pipe(EntityRuler(nlp, patterns=prop_patterns, validate=True))
nlp.add_pipe(nlp.create_pipe('merge_entities'))

text = "In Europe, the European Central Bank is trying to calm the markets. But the crisis could last for months."
doc = nlp(text)
for sent in doc.sents:
    print([tok.text for tok in sent])

printing

['In', 'Europe', ',', 'the', 'European Central Bank', 'is', 'trying', 'to', 'calm', 'the', 'markets', '.']
['But', 'the', 'crisis', 'could', 'last', 'for', 'months', '.']

as expected.

But when we disable the parser and instead use Sentencizer to assign sentence boundaries, we get

['In', 'Europe', ',', 'the', 'European', 'Central', 'Bank', 'is', 'trying', 'to', 'calm', 'the', 'markets', '.']
['But', 'the', 'crisis', 'could', 'last', 'for', 'months', '.']

where European Central Bank is not merged as an entity (full code is listed at the bottom).

Possible causes, solutions, and (currently impossible) workarounds

When a token is not at the beginning of a sentence, the dependency parser does not actually set token.is_sent_start (leaving it as None), while Sentencizer sets it to False.
(How exactly this leads to the difference in matching behavior is not 100% clear to me).

Trivalent sentence boundary marking and assigning is_sentenced

The behavior of the dependency parser is probably the better option, because with its trivalent logic multiple sentence segmentation components could complement each other in a given pipeline.

But just removing the following two lines

                    else:
                        doc.c[j].sent_start = -1

from Sentencizer.set_annotations would cause problems with Docs consisting of a single sentence: given how it is currently implemented, Doc.is_sentenced would be False.

(Currently impossible?) workarounds

Instead of specifying "SENT_START": False, we could try to write the pattern in terms of != True, which would include both False and None.
Unfortunately, this does not seem possible right now.
The following two approaches yield errors complaining that the complex right-hand side is not of type 'boolean' [0 -> SENT_START]:

{"IS_TITLE": True, "SENT_START": {"NOT_IN": [True]}}
{"IS_TITLE": True, "IS_SENT_START": {"!=": True}}

(The latter issue may be related to #5281).

Finally, I tried writing the pattern like

prop_patterns = [
    {   # Proper names come in title case
        "label": "ProperName",
        "pattern": [ 
            {"SENT_START": True, "OP": "!"},
            {"IS_TITLE": True, "OP": "+"}]
    }]

and the results were wild: in the above example it will (expectedly) yield the merged token the European Central Bank; but when a sentence starts with multiple tokens in Title Case, sentence boundaries get messed up (again expectedly: merging a match involving a sentence boundary invalidates the Spans marking sentences).

The full code to reproduce the second output above (using IS_SENT_START instead of SENT_START yields the same pattern):

nlq = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlq.add_pipe(Sentencizer())
nlq.add_pipe(EntityRuler(nlq, patterns=prop_patterns, validate=True))
nlq.add_pipe(nlq.create_pipe('merge_entities'))

text = "In Europe, the European Central Bank is trying to calm the markets. But the crisis could last for months."
doc = nlq(text)
for sent in doc.sents:
    print([tok.text for tok in sent])
for tok in doc:
    print(tok.text, repr(tok.is_sent_start))

Your Environment

  • spaCy version: 2.2.4
  • Platform: Darwin-18.7.0-x86_64-i386-64bit
  • Python version: 3.7.7
@svlandeg svlandeg added feat / matcher Feature: Token, phrase and dependency matcher feat / parser Feature: Dependency Parser feat / sentencizer Feature: Sentencizer (rule-based sentence segmenter) labels Apr 10, 2020
@svlandeg
Copy link
Member

This relates to #4775. I agree that it's confusing that one component sets non-sentence boundaries as None, and the other as False. I think @adrianeboyd is considering the options as this is a tricky issue when you want to combine different pipeline components that would set sentence boundaries in sequence.

As a temporary workaround, have you tried pulling in branch #5282 to check whether that would solve your issue, using != True ? It would be a good test for that PR as well.

@peter-exos
Copy link
Contributor Author

peter-exos commented Apr 10, 2020

It doesn't look like #5282 solves the issue.

In fact, it looks like a completely different issue.
After pulling #5282 in I still get the same error message:

spacy.errors.MatchPatternError: Invalid token patterns for matcher rule 'ProperName'

Pattern 0:
- {'!=': True} is not of type 'boolean' [0 -> SENT_START]

This is the same error as I got when trying the workarounds mentioned above.

The issue seems to be that any type of extended pattern syntax (like 'IN' or '!=') is not compatible with Boolean attributes, e.g. the following patterns all throw the same errors:

{"IS_TITLE" : {"!=": False}}
{"IS_TITLE" : {"IN": [True]}}

This issue seems to be independent of #5282.

Just to confirm this, here is a self-contained code snippet, which breaks in the last line:

import spacy
from spacy.pipeline import EntityRuler
from spacy.matcher import Matcher
from spacy.tokens import Doc

prop_patterns = [
    {   # Proper names come in title case
        "label": "ProperName",
        "pattern": [ 
            {"IS_TITLE" : {"==": True}},
            {"IS_TITLE": True, "OP": "+"}]
    }]

nlp = spacy.load('en_core_web_sm', disable=['ner'])

matcher = Matcher(nlp.vocab)
pattern = [{"LENGTH": {"!=": 2}}]
matcher.add("LENGTH_COMPARE", [pattern])
doc = Doc(nlp.vocab, words=["a", "aa", "aaa"])
matches = matcher(doc)
assert len(matches) == 2

nlp.add_pipe(EntityRuler(nlp, patterns=prop_patterns, validate=True))

@adrianeboyd adrianeboyd added the bug Bugs and behaviour differing from documentation label Apr 24, 2020
@adrianeboyd
Copy link
Contributor

For SENT_START, the Matcher is trying to compare boolean values in the pattern to -1/0/1 values underneath, which unsurprisingly does not really work that well for False == -1. This is only a problem for IS_SENT_START, since the other boolean attributes are actually boolean underneath.

My proposed API is kind of clunky, but I think the easiest short-term solution is to normalize this value to True/False when the Matcher accesses it. I think we probably need to rethink how to store/show the sentence boundary values in the future.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher feat / parser Feature: Dependency Parser feat / sentencizer Feature: Sentencizer (rule-based sentence segmenter)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants