KneserNeyInterpolated has problem with OOV words during testing and perplexity is always inf #3211

nilinykh · 2023-12-07T11:04:37Z

nltk 3.8.1

I am training and testing a language model on my corpus of sentences using KneserNeyInterpolated.
It looks like the nltk implementation of this smoothing algorithm does not know what to do with out-of-vocabulary words as during testing the model's perplexity on them is infinity. But the problem appears only in specific situations, more is below.

Training:

train_input, train_vocab = padded_everygram_pipeline(order=2, text=train_tokens)
lm = KneserNeyInterpolated(order=2)
lm.fit(train_input, train_vocab)

Then I test it on bigrams (following some suggestions from #3065)

from nltk.util import bigrams
from nltk.lm.preprocessing import pad_both_ends
grams = list(bigrams(pad_both_ends('a motorcycle sits parked across from a herd of livestock'.split(), n=2)))

print(grams)
print(lm.entropy(grams))
print(lm.perplexity(grams))

And the result is

[('<s>', 'a'), ('a', 'motorcycle'), ('motorcycle', 'sits'), ('sits', 'parked'), ('parked', 'across'), ('across', 'from'), ('from', 'a'), ('a', 'herd'), ('herd', 'of'), ('of', 'livestock'), ('livestock', '</s>')]
inf
inf

I have looked at perplexity of the model for each of these bigrams and found that for ('of', 'livestock') perplexity if inf, but for ('livestock', '</s>') perplexity is some number, not inf. The problematic word (I suspect it's 'livestock') has not been seen during training: lm.vocab['livestock'] is 0.

I tested another OOV word ('radiator') and observed exactly the same situation: perplexity is inf when 'radiator' is the second word in the bigram (the first word can be anything), but when 'radiator' is the first word, perplexity is some number disregarding what the second word is. Why would such behaviour occur?

MLE and Laplace work fine, the problem does not occur.

What could be wrong?

The text was updated successfully, but these errors were encountered:

Higgs32584 · 2023-12-22T17:19:55Z

I will look into this. I get the same error as well.

Higgs32584 · 2023-12-22T20:21:05Z

I tried some things and this is what I got. the second set of numbers is MLE

none out of context

1.3831080559223603

0.4679138720730883

inf

#first bigram positon out of context

1.7481417052901929

0.8058221352008101

inf

#second bigram position out of context

inf

#all out of context
#inf
#inf
#inf
#inf

I am not too sure if out of context words have a specific way they are supposed to be handled. Can you do some more testing to figure out if this is math issue or something else? maybe provide expected and actual outcomes. From what I can tell KneserNeyInterpolated it is challenging to tell what the expect functionality for OOV is supposed to be.

Higgs32584 · 2023-12-26T23:14:07Z

This also might be related to issue #2727

jakbtg · 2024-01-29T15:53:43Z

Hello, I had the same issue. I solved it downgrading NLTK to version 3.6.1. Seems like there is some bug on version 3.8.1, because the exact same code works on version 3.6.1.

nilinykh · 2024-01-30T13:02:17Z

I can confirm that this error does not appear when using nltk 3.6.1.

What is different between 3.6.1 and 3.8.1 is the implementation of KneserNey: in 3.8.1 the implementation has discounting implemented together with count continuations which are important because of different orders of n-grams (lower or higher), e.g. P_continuation in 3.38 in Jurafsky and Martin.

I believe this is an improvement for Kneser-Ney algorithm so it makes sense to have it as uni-/ and bi-grams should have different discount factor compared to higher-order n-grams.
The code in 3.6.1 does not implement this intuition: it has a fixed discount factor for all n-grams.

Does it mean that the problem is in the implementation of the continuation counts? If so, then the code in 3.6.1 does absolute discounting and not Kneser-Ney discounting, so it would be incorrect to use the code from nltk==3.6.1 and treat it as Kneser-Ney method, because it's not complete. The code in 3.8.1 looks more complete, but something is clearly wrong in calculating continuation counts.

nilinykh · 2024-01-30T14:08:26Z

I am taking some of my words back: nltk 3.8.1 seems to correctly implement Kneser-Ney with discounting. These would be alpha and gamma parts of the algorithm implementation which correspond to the original formula in Jurafsky et al.

However, it does not seem to handle OOV words during testing - this is where the problem occurs. In Jurafsky they talk about this issue and possible solution in Eq. 3.42.

Can someone point to part in the code which implements that?

Kadorath · 2024-02-20T05:33:59Z

Can someone point to part in the code which implements that?

I'm not super familiar with this, but it looks to me like the part of the code that implements the unknown word handling as described by Jurafsky was captured by the 3.6.1 fixed discount factor of 1.0/vocab_length in unigram_score, whereas now, unigram_score doesn't include that term any longer, and so unknown words which of course have a continuation count of 0 get a score of 0.

I think it might be that it should, in addition to dividing continuation count by total count, also add the lambda(E)*1/V term as seen in Jurafsky 3.41? That way unknown unigrams wouldn't be zero, just close to it, and all other unigrams would have their score very slightly increased by the uniform distribution?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KneserNeyInterpolated has problem with OOV words during testing and perplexity is always inf #3211

KneserNeyInterpolated has problem with OOV words during testing and perplexity is always inf #3211

nilinykh commented Dec 7, 2023

Higgs32584 commented Dec 22, 2023

Higgs32584 commented Dec 22, 2023

Higgs32584 commented Dec 26, 2023

jakbtg commented Jan 29, 2024

nilinykh commented Jan 30, 2024 •

edited

nilinykh commented Jan 30, 2024 •

edited

Kadorath commented Feb 20, 2024

KneserNeyInterpolated has problem with OOV words during testing and perplexity is always inf #3211

KneserNeyInterpolated has problem with OOV words during testing and perplexity is always inf #3211

Comments

nilinykh commented Dec 7, 2023

Higgs32584 commented Dec 22, 2023

Higgs32584 commented Dec 22, 2023

none out of context

1.3831080559223603

0.4679138720730883

inf

inf

1.7481417052901929

0.8058221352008101

inf

inf

inf

inf

inf

inf

Higgs32584 commented Dec 26, 2023

jakbtg commented Jan 29, 2024

nilinykh commented Jan 30, 2024 • edited

nilinykh commented Jan 30, 2024 • edited

Kadorath commented Feb 20, 2024

nilinykh commented Jan 30, 2024 •

edited

nilinykh commented Jan 30, 2024 •

edited