Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Precision, Recall, F-measure, Confusion Matrix to Taggers #2862

Merged
merged 7 commits into from Dec 15, 2021

Conversation

tomaarsen
Copy link
Member

Hello!

Pull request overview

  • Implement Precision, Recall, F-measure, Confusion matrices and a per-tag evaluation for all Taggers.
  • Implement Precision, Recall, F-measure and per-tag evaluation for ConfusionMatrix
  • Add large sections to tag.doctest and metrics.doctest that show these additions.
  • Small fixes for some in-method doctests throughout the tag package.

Method overview

Every Tagger in NLTK subclasses the TaggerI interface, which used to provide the following methods:

  • tag(tokens)
  • tag_sents(sentences)
  • evaluate(gold)

After this PR, it also provides

  • confusion(gold)
  • recall(gold)
  • precision(gold)
  • f_measure(gold, alpha=0.5)
  • evaluate_per_tag(self, gold, alpha=0.5, truncate=None, sort_by_count=False)

Beyond that, nltk/metrics/confusionmatrix provides a ConfusionMatrix class, which this PR gives the following methods:

  • recall(value)
  • precision(value)
  • f_measure(value, alpha=0.5)
  • evaluate(alpha=0.5, truncate=None, sort_by_count=False)

Reasoning

In my experience of working with the NLTK Taggers, the evaluation that can easily be done is very minimal. You can call tagger.evaluate(gold) to compute an accuracy, but that gives no information on what tokens are actually being tagged correctly, or whether we're over- or under-fitting certain tags. Only accuracy simply isn't enough.

So, I went looking for recall, precision and f-measures in the codebase. We've implemented these in nltk/metrics/scores.py, but they're very much written for IR tasks. They take sets, and use set intersections to compute the values. This doesn't work for Taggers, as tags need to be able to occur multiple times without being removed due to the defining set property.

Changes for ConfusionMatrix

For ConfusionMatrix, recall(value), precision(value) and f_measure(value, alpha=0.5) are all very similar, and they return simply the floats for the corresponding metrics, for that value.

E.g., in the following ConfusionMatrix:

>>> reference = "DET NN VB DET JJ NN NN IN DET NN".split()
>>> test = "DET VB VB DET NN NN NN IN DET NN".split()
>>> cm = ConfusionMatrix(reference, test)
>>> cm.pretty_format(sort_by_count=True)
    |   D       |
    | N E I J V |
    | N T N J B |
----+-----------+
 NN |<3>. . . 1 |
DET | .<3>. . . |
 IN | . .<1>. . |
 JJ | 1 . .<.>. |
 VB | . . . .<1>|
----+-----------+
(row = reference; col = test)

The recall for VB will be 1.0 (True positive is 1, False negative is 0), and the precision for VB will be 0.5 (True positive is 1, False positive is 1).

Furthermore, the new method evaluate will output all of this information concisely in a tabular format:

>>> print(cm.evaluate())
Tag | Prec.  | Recall | F-measure
----+--------+--------+-----------
DET | 1.0000 | 1.0000 | 1.0000
 IN | 1.0000 | 1.0000 | 1.0000
 JJ | 0.0000 | 0.0000 | 0.0000
 NN | 0.7500 | 0.7500 | 0.7500
 VB | 0.5000 | 1.0000 | 0.6667

This method evaluate uses recall, precision and f_measure internally. These 3 methods can also be normally called to get the float result.

Changes for TaggerI

These new changes for ConfusionMatrix have very interesting consequences, for example for taggers. I've introduced methods for TaggerI, which mostly speak for themselves, especially when I provide some examples:

Set up a (pretrained) tagger

>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import treebank
>>> tagger = PerceptronTagger()
>>> gold_data = treebank.tagged_sents()[:10]

Evaluate with accuracy

This method already existed!

>>> tagger.evaluate(gold_data)
0.8940677966101694

Evaluate with recall

This method, and the next two, will return the per-tag metrics, so developers have a machine-readable way to use these metrics for whatever use they see fit.

>>> tagger.recall(gold_data)
{"''": 1.0, ',': 1.0, '-NONE-': 0.0, '.': 1.0, 'CC': 1.0, 'CD': 1.0, 'DT': 1.0, 'EX': 1.0, 'IN': 0.88, 'JJ': 0.8888888888888888, 'JJR': 0.0,
'JJS': 1.0, 'MD': 1.0, 'NN': 0.9333333333333333, 'NNP': 1.0, 'NNS': 1.0, 'POS': 1.0, 'PRP': 1.0, 'PRP$': 1.0, 'RB': 1.0, 'RBR': 0.5,
'RP': 1.0, 'TO': 1.0, 'VB': 1.0, 'VBD': 0.8571428571428571, 'VBG': 0.8, 'VBN': 0.8, 'VBP': 1.0, 'VBZ': 1.0, 'WDT': 0.0, '``': 1.0}

Evaluate with precision

>>> tagger.precision(gold_data)
{"''": 1.0, ',': 1.0, '-NONE-': 0.0, '.': 1.0, 'CC': 1.0, 'CD': 0.7142857142857143, 'DT': 1.0, 'EX': 1.0, 'IN': 0.9166666666666666,
'JJ': 0.8888888888888888, 'JJR': 0.0, 'JJS': 1.0, 'MD': 1.0, 'NN': 0.8, 'NNP': 0.8928571428571429, 'NNS': 0.95, 'POS': 1.0,
'PRP': 1.0, 'PRP$': 1.0, 'RB': 0.4, 'RBR': 1.0, 'RP': 1.0, 'TO': 1.0, 'VB': 1.0, 'VBD': 0.8571428571428571, 'VBG': 1.0, 'VBN': 1.0,
'VBP': 1.0, 'VBZ': 1.0, 'WDT': 0.0, '``': 1.0}

Evaluate with f_measure

>>> tagger.f_measure(gold_data)
{"''": 1.0, ',': 1.0, '-NONE-': 0.0, '.': 1.0, 'CC': 1.0, 'CD': 0.8333333333333334, 'DT': 1.0, 'EX': 1.0, 'IN': 0.8979591836734693,
'JJ': 0.8888888888888888, 'JJR': 0.0, 'JJS': 1.0, 'MD': 1.0, 'NN': 0.8615384615384616, 'NNP': 0.9433962264150942,
'NNS': 0.9743589743589745, 'POS': 1.0, 'PRP': 1.0, 'PRP$': 1.0, 'RB': 0.5714285714285714, 'RBR': 0.6666666666666666,
'RP': 1.0, 'TO': 1.0, 'VB': 1.0, 'VBD': 0.8571428571428571, 'VBG': 0.8888888888888888, 'VBN': 0.8888888888888888, 
'VBP': 1.0, 'VBZ': 1.0, 'WDT': 0.0, '``': 1.0}

Evaluate with evaluate_per_tag

This method provides the human-readable form of the recall, precision and f_measure methods, allowing developers of taggers to inspect where their taggers are still performing suboptimally. Immediately upon looking at this output can you see that the default NLTK pre-trained PerceptronTagger has a really high recall for CD, while it has a low precision there. This is indicative that too many tokens are tagged as CD, and is something that the developer could look into.

This is only for 10 sentences, but there's a lot of interesting information to be gleamed when you use the entire treebank section that NLTK has access to. (For example, JJ has a precision of 0.6163 with a recall of 0.9131!)

>>> print(tagger.evaluate_per_tag(gold_data, sort_by_count=True))
   Tag | Prec.  | Recall | F-measure
-------+--------+--------+-----------
    NN | 0.8000 | 0.9333 | 0.8615
    IN | 0.9167 | 0.8800 | 0.8980
   NNP | 0.8929 | 1.0000 | 0.9434
    DT | 1.0000 | 1.0000 | 1.0000
   NNS | 0.9500 | 1.0000 | 0.9744
    JJ | 0.8889 | 0.8889 | 0.8889
     , | 1.0000 | 1.0000 | 1.0000
-NONE- | 0.0000 | 0.0000 | 0.0000
     . | 1.0000 | 1.0000 | 1.0000
   VBD | 0.8571 | 0.8571 | 0.8571
   VBZ | 1.0000 | 1.0000 | 1.0000
    CD | 0.7143 | 1.0000 | 0.8333
    TO | 1.0000 | 1.0000 | 1.0000
   VBG | 1.0000 | 0.8000 | 0.8889
   VBN | 1.0000 | 0.8000 | 0.8889
   PRP | 1.0000 | 1.0000 | 1.0000
    RB | 0.4000 | 1.0000 | 0.5714
    VB | 1.0000 | 1.0000 | 1.0000
   VBP | 1.0000 | 1.0000 | 1.0000
  PRP$ | 1.0000 | 1.0000 | 1.0000
   RBR | 1.0000 | 0.5000 | 0.6667
   WDT | 0.0000 | 0.0000 | 0.0000
    '' | 1.0000 | 1.0000 | 1.0000
    CC | 1.0000 | 1.0000 | 1.0000
    EX | 1.0000 | 1.0000 | 1.0000
   JJS | 1.0000 | 1.0000 | 1.0000
    MD | 1.0000 | 1.0000 | 1.0000
   POS | 1.0000 | 1.0000 | 1.0000
    RP | 1.0000 | 1.0000 | 1.0000
    `` | 1.0000 | 1.0000 | 1.0000
   JJR | 0.0000 | 0.0000 | 0.0000

Evaluate with confusion

This method goes perfectly with the previous one: A mismatch in precision/recall doesn't always give all the information that a developer would need to find out what truly is the issue at hand. Being able to quickly show a confusion matrix like this can ease understanding significantly.

>>> print(tagger.confusion(gold_data))
       |        -                                                                                     |
       |        N                                                                                     |
       |        O                                               P                                     |
       |        N                       J  J        N  N  P  P  R     R           V  V  V  V  V  W    |
       |  '     E     C  C  D  E  I  J  J  J  M  N  N  N  O  R  P  R  B  R  T  V  B  B  B  B  B  D  ` |
       |  '  ,  -  .  C  D  T  X  N  J  R  S  D  N  P  S  S  P  $  B  R  P  O  B  D  G  N  P  Z  T  ` |
-------+----------------------------------------------------------------------------------------------+
    '' | <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
     , |  .<15> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
-NONE- |  .  . <.> .  .  2  .  .  .  2  .  .  .  5  1  .  .  .  .  2  .  .  .  .  .  .  .  .  .  .  . |
     . |  .  .  .<10> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    CC |  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    CD |  .  .  .  .  . <5> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    DT |  .  .  .  .  .  .<20> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    EX |  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    IN |  .  .  .  .  .  .  .  .<22> .  .  .  .  .  .  .  .  .  .  3  .  .  .  .  .  .  .  .  .  .  . |
    JJ |  .  .  .  .  .  .  .  .  .<16> .  .  .  .  1  .  .  .  .  1  .  .  .  .  .  .  .  .  .  .  . |
   JJR |  .  .  .  .  .  .  .  .  .  . <.> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   JJS |  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    MD |  .  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    NN |  .  .  .  .  .  .  .  .  .  .  .  .  .<28> 1  1  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   NNP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .<25> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   NNS |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .<19> .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   POS |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   PRP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <4> .  .  .  .  .  .  .  .  .  .  .  .  . |
  PRP$ |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <2> .  .  .  .  .  .  .  .  .  .  .  . |
    RB |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <4> .  .  .  .  .  .  .  .  .  .  . |
   RBR |  .  .  .  .  .  .  .  .  .  .  1  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  . |
    RP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  . |
    TO |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <5> .  .  .  .  .  .  .  . |
    VB |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <3> .  .  .  .  .  .  . |
   VBD |  .  .  .  .  .  .  .  .  .  .  .  .  .  1  .  .  .  .  .  .  .  .  .  . <6> .  .  .  .  .  . |
   VBG |  .  .  .  .  .  .  .  .  .  .  .  .  .  1  .  .  .  .  .  .  .  .  .  .  . <4> .  .  .  .  . |
   VBN |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  1  . <4> .  .  .  . |
   VBP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <3> .  .  . |
   VBZ |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <7> .  . |
   WDT |  .  .  .  .  .  .  .  .  2  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <.> . |
    `` |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <1>|
-------+----------------------------------------------------------------------------------------------+
(row = reference; col = test)

I would recommend having a look at the updated nltk/test/tag.doctest, which shows some more examples of how these methods can be very very useful in the development process of taggers.

Implementation details

The implementation on the ConfusionMatrix side is very simple. It's simply a case of recognising TP, FP, FN and TN, and using them to compute the precision, recall, f_measure and the evaluation table.

And for the TaggerI side it's also fairly simple: The gold parameter (i.e. the known correct list of tagged sentences) is used as the "reference", while the sentences from this gold are tagged by the tagger to produce "predicted" tags. Together, these are the two dimensions for a ConfusionMatrix. Then, recall, precision, f_measure, confusion and evaluate_per_tag all simply use the ConfusionMatrix methods.

The only bit of implementation magic is that the confusion(gold) method will call another method: self._confusion_cached, it will first convert gold to a tuple of tuples, rather than a list of lists. This is because tuples are hashable, while lists aren't. So, with the input to self._confusion_cached being a tuple, we can (as the name suggests) cache this method call. I've set the maxsize of the cache to 1, so only 1 confusion matrix is ever cached. That should most likely be fine.
In short, despite that every method calls self.confusion(), the tagging and setting up the ConfusionMatrix is only done once.

Doctest changes

As you might be able to see in the PR, I've added # doctest: +NORMALIZE_WHITESPACE in a few places. Previously, doctest would fail here, as the predicted output lists were spread out over multiple lines.

Beyond that, nltk/test/metrics.doctest has 3 more tests, and nltk/test/tag.doctest has been improved significantly.

  • Tom Aarsen

…tion to Taggers

And add precision, recall and f-measure to ConfusionMatrix.

Includes large doctests, and some small doctest fixes throughout the tag module
@stevenbird stevenbird self-assigned this Oct 22, 2021
@stevenbird
Copy link
Member

stevenbird commented Oct 24, 2021

tagger.evaluate(gold_data)

How about deprecating this in favour of tagger.accuracy(gold_data)

@tomaarsen
Copy link
Member Author

Sounds great. I wasn't a big fan of an evaluate method simply returning an accuracy float to begin with. Accuracy is just a somewhat naive evaluation metric after all. I'll get on it.

@tomaarsen
Copy link
Member Author

tomaarsen commented Oct 25, 2021

This PR now has some additional changes not described in the original text:

  • TaggerI's evaluate(gold) is now deprecated in favor of accuracy(gold). The former can still be used, but it throws a warning.
  • Similarly, ChunkParserI's evaluate(gold) is now deprecated in favor of accuracy(gold).

So, this PR is no longer exclusively about taggers, but also affects a parser.

@stevenbird stevenbird merged commit a28d256 into nltk:develop Dec 15, 2021
@stevenbird
Copy link
Member

Thanks @tomaarsen – great contribution!

@tomaarsen tomaarsen deleted the feature/tagger-metrics branch December 16, 2021 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants