Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KneserNeyInterpolated has problem with OOV words during testing and perplexity is always inf #3211

Open
nilinykh opened this issue Dec 7, 2023 · 7 comments

Comments

@nilinykh
Copy link

nilinykh commented Dec 7, 2023

nltk 3.8.1

I am training and testing a language model on my corpus of sentences using KneserNeyInterpolated.
It looks like the nltk implementation of this smoothing algorithm does not know what to do with out-of-vocabulary words as during testing the model's perplexity on them is infinity. But the problem appears only in specific situations, more is below.

Training:

train_input, train_vocab = padded_everygram_pipeline(order=2, text=train_tokens)
lm = KneserNeyInterpolated(order=2)
lm.fit(train_input, train_vocab)

Then I test it on bigrams (following some suggestions from #3065)

from nltk.util import bigrams
from nltk.lm.preprocessing import pad_both_ends
grams = list(bigrams(pad_both_ends('a motorcycle sits parked across from a herd of livestock'.split(), n=2)))

print(grams)
print(lm.entropy(grams))
print(lm.perplexity(grams))

And the result is

[('<s>', 'a'), ('a', 'motorcycle'), ('motorcycle', 'sits'), ('sits', 'parked'), ('parked', 'across'), ('across', 'from'), ('from', 'a'), ('a', 'herd'), ('herd', 'of'), ('of', 'livestock'), ('livestock', '</s>')]
inf
inf

I have looked at perplexity of the model for each of these bigrams and found that for ('of', 'livestock') perplexity if inf, but for ('livestock', '</s>') perplexity is some number, not inf. The problematic word (I suspect it's 'livestock') has not been seen during training: lm.vocab['livestock'] is 0.

I tested another OOV word ('radiator') and observed exactly the same situation: perplexity is inf when 'radiator' is the second word in the bigram (the first word can be anything), but when 'radiator' is the first word, perplexity is some number disregarding what the second word is. Why would such behaviour occur?

MLE and Laplace work fine, the problem does not occur.

What could be wrong?

@Higgs32584
Copy link

I will look into this. I get the same error as well.

@Higgs32584
Copy link

I tried some things and this is what I got. the second set of numbers is MLE

none out of context

1.3831080559223603

0.4679138720730883

inf

inf

#first bigram positon out of context

1.7481417052901929

0.8058221352008101

inf

inf

#second bigram position out of context

inf

inf

inf

inf

#all out of context
#inf
#inf
#inf
#inf

I am not too sure if out of context words have a specific way they are supposed to be handled. Can you do some more testing to figure out if this is math issue or something else? maybe provide expected and actual outcomes. From what I can tell KneserNeyInterpolated it is challenging to tell what the expect functionality for OOV is supposed to be.

@Higgs32584
Copy link

This also might be related to issue #2727

@jakbtg
Copy link

jakbtg commented Jan 29, 2024

Hello, I had the same issue. I solved it downgrading NLTK to version 3.6.1. Seems like there is some bug on version 3.8.1, because the exact same code works on version 3.6.1.

@nilinykh
Copy link
Author

nilinykh commented Jan 30, 2024

I can confirm that this error does not appear when using nltk 3.6.1.

What is different between 3.6.1 and 3.8.1 is the implementation of KneserNey: in 3.8.1 the implementation has discounting implemented together with count continuations which are important because of different orders of n-grams (lower or higher), e.g. P_continuation in 3.38 in Jurafsky and Martin.

I believe this is an improvement for Kneser-Ney algorithm so it makes sense to have it as uni-/ and bi-grams should have different discount factor compared to higher-order n-grams.
The code in 3.6.1 does not implement this intuition: it has a fixed discount factor for all n-grams.

Does it mean that the problem is in the implementation of the continuation counts? If so, then the code in 3.6.1 does absolute discounting and not Kneser-Ney discounting, so it would be incorrect to use the code from nltk==3.6.1 and treat it as Kneser-Ney method, because it's not complete. The code in 3.8.1 looks more complete, but something is clearly wrong in calculating continuation counts.

@nilinykh
Copy link
Author

nilinykh commented Jan 30, 2024

I am taking some of my words back: nltk 3.8.1 seems to correctly implement Kneser-Ney with discounting. These would be alpha and gamma parts of the algorithm implementation which correspond to the original formula in Jurafsky et al.

However, it does not seem to handle OOV words during testing - this is where the problem occurs. In Jurafsky they talk about this issue and possible solution in Eq. 3.42.

Can someone point to part in the code which implements that?

@Kadorath
Copy link

Can someone point to part in the code which implements that?

I'm not super familiar with this, but it looks to me like the part of the code that implements the unknown word handling as described by Jurafsky was captured by the 3.6.1 fixed discount factor of 1.0/vocab_length in unigram_score, whereas now, unigram_score doesn't include that term any longer, and so unknown words which of course have a continuation count of 0 get a score of 0.

I think it might be that it should, in addition to dividing continuation count by total count, also add the lambda(E)*1/V term as seen in Jurafsky 3.41? That way unknown unigrams wouldn't be zero, just close to it, and all other unigrams would have their score very slightly increased by the uniform distribution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants