Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nltk.lm.api entropy formula source? #2961

Closed
mbauwens opened this issue Mar 11, 2022 · 18 comments · Fixed by #3229
Closed

nltk.lm.api entropy formula source? #2961

mbauwens opened this issue Mar 11, 2022 · 18 comments · Fixed by #3229
Assignees

Comments

@mbauwens
Copy link
Contributor

mbauwens commented Mar 11, 2022

Hi! I've been experimenting with training and testing a standard trigram language model on my own dataset. Upon investigating the entropy method of the LM class, I was a bit confused. The docs only mention a very brief description: "Calculate cross-entropy of model for given evaluation text."

It seems to me that this entropy measure just averages the ngram negative log probability (so trigram in my case) and makes it positive by multiplying by -1:

def _mean(items):
    """Return average (aka mean) for sequence of items."""
    return sum(items) / len(items)

def entropy(self, text_ngrams):
    """Calculate cross-entropy of model for given evaluation text.

    :param Iterable(tuple(str)) text_ngrams: A sequence of ngram tuples.
    :rtype: float

    """
    return -1 * _mean(
        [self.logscore(ngram[-1], ngram[:-1]) for ngram in text_ngrams]
    )

So my general question is: Where does this formula for entropy stem from? Is there any paper referencing the method? I'm just a bit stuck with the different versions of entropy that exist and I don't know which one is used here (and therefore I don't know how to interpret it correctly).

Other formulas of entropy/perplexity

As a formula I know the Shannon entropy which in Python would be:

def shannon_entropy(self, text_ngrams):
    return -1 * sum(
        [self.score(ngram[-1], ngram[:-1]) * self.logscore(ngram[-1], ngram[:-1]) for ngram in text_ngrams]
    )

And there's also the perplexity formula of Jurafsky, which returns a different score than lm.perplexity (which is 2**entropy):

from numpy import prod

def jurafsky_perplexity(self, text_ngrams):
    problist = [self.score(ngram[-1], ngram[:-1]) for ngram in text_ngrams]
    return pow(prod(problist), -1/len(problist)

Thanks!

PS: My apologies if this issue lacks some information - I don't often open issues on Github 😅 Let me know if I need to update something!

UPDATE: my Python implementation of the Jurafsky perplexity was totally wrong as I had quickly written it. I've updated it to reflect the actual scores from the web lecture.

@mbauwens mbauwens changed the title nltk.lm.api entropy formula? nltk.lm.api entropy formula source? Mar 11, 2022
@tomaarsen
Copy link
Member

@mbauwens Hello!

I dug into the code a little bit, and the git blame reported that the entropy calculation was last updated in this commit: 9e7989e

The commit info is:

 Settling on entropy implementation, small simplifications along the way
- Entropy calculation in line with JBG's implementation
- Simpler model tests without mixin
- Got rid of unnecessary local variables

Some quick searching indicates that JBG might refer to Jordan Boyd-Graber, who has taught about Language Models at the University of Maryland. That said, I wasn't able to find a specific paper, nor am I certain that it indeed refers to Boyd-Graber.

We would be best off just asking @iliakur, the author of the NLTK language model package.

@mbauwens
Copy link
Contributor Author

Some quick searching indicates that JBG might refer to Jordan Boyd-Graber, who has taught about Language Models at the University of Maryland. That said, I wasn't able to find a specific paper, nor am I certain that it indeed refers to Boyd-Graber.

Thanks for the quick reply! Ok, interesting. After a quick Google search, it does seem Jordan Boyd-Graber (if that's what "JBG" references) actually also references the Shannon Entropy formula (see slide 16, second formula) so that does make it more 'perplexing'... :)

Given that it's hard to find the source of this formula and general theory (as far as I know) usually references Shannon Entropy in the domain of information theory/communication/language, it might not be a bad idea to change this formula into the more well-documented one? Unless there were clear theoretic arguments in favour of the formula that's used right now, of course.

Looking forward to hear @iliakur 's thoughts!

@iliakur
Copy link
Contributor

iliakur commented Mar 15, 2022

Hi all! I'm currently swamped, will respond to this in the course of this week. @mbauwens thanks for the detailed issue 🙏

@mbauwens
Copy link
Contributor Author

mbauwens commented Mar 16, 2022

Small update: my formula for perplexity based on Jurafsky (in the original post) was completely wrong so I did correct that. I dug a little deeper by comparing Shannon (entropy and perplexity) with the NLTK formulas (entropy and perplexity) as well as Jurafsky's implementation of perplexity.

I've written a little program so you can easily compare the outcome of the formulas by supplying the main uncertainty() function with a list of probabilities. You can find it in this public gist or below. I've tested it in Python3.8 and it only uses numpy for some very easy helper functions (log2, product of a list, mean). The post is pretty long with the code included, so let me know if it's best to remove it from this post and just refer to the gist.

Conclusions

  • Jurafsky perplexity is the same as NLTK perplexity if computed on small probability lists. Probably a mathematician had seen this sooner 😅 With a very large random list (see last example), Jurafsky perplexity just becomes inf. I don't know what the threshold is.
  • All measures for entropy and perplexity return the same result (ignoring floating accuracy issues) in the case of a list of equal probabilities (which is not the general use case for n-gram probabilities), both when the list is small and very large. For a list of unequal probabilities, Shannon entropy and perplexity have different results because they weigh the score by multiplying with the probability (cf. this intuitive interpretation). Shannon entropy and perplexity always indicate less uncertainty than the NLTK/Jurafsky version. Differences in perplexity are enlarged very much in larger probability lists due to the exponential nature of perplexity.
from numpy import log2, prod, mean

def shannon_entropy(problist):
    """Shannon Entropy: negative sum over all probabilities*log2_probabilities
    https://en.wikipedia.org/wiki/Entropy_(information_theory)
    """
    return -1 * sum([prob * log2(prob) for prob in problist])

def nltk_entropy(problist):
    """From nltk.lm: negative average log2_probabilities
    https://www.nltk.org/api/nltk.lm.api.html#nltk.lm.api.LanguageModel.entropy
    """
    return -1 * mean([log2(prob) for prob in problist])

def perplexity(entropy):
    """From nltk.lm: 2**entropy
    This is also a general formula for entropy, not only by NLTK
    https://www.nltk.org/api/nltk.lm.api.html#nltk.lm.api.LanguageModel.perplexity
    """
    return pow(2.0, entropy)

def jurafsky_perplexity(problist):
    """From Jurafsky, Stanford NLP: product of all probabilities** -1/count_of_probabilities
    https://www.youtube.com/watch?v=NCyCkgMLRiY
    """
    probs = [prob for prob in problist]
    return pow(prod(probs), -1/len(probs))

def uncertainty(problist):
    prob_sum =  sum(problist) # just to be sure our total probability mass == 1
    shan_ent = shannon_entropy(problist)
    shan_prplx = perplexity(shan_ent)
    nltk_ent = nltk_entropy(problist)
    nltk_prplx = perplexity(nltk_ent)
    juraf_prplx = jurafsky_perplexity(problist)
    
    print(
        f"Probability sum: {prob_sum}\n\n"
        f"Shannon entropy: {shan_ent}\n"
        f"Shannon perplexity: {shan_prplx}\n\n"
        f"NLTK entropy: {nltk_ent}\n"
        f"NLTK perplexity: {nltk_prplx}\n\n"
        f"Jurafsky perplexity: {juraf_prplx}"
    )

# some examples with equal probability
# coin toss
uncertainty([0.5]*2)
# Probability sum: 1.0
#
# Shannon entropy: 1.0
# Shannon perplexity: 2.0
#
# NLTK entropy: 1.0
# NLTK perplexity: 2.0
#
# Jurafsky perplexity: 2.0

uncertainty([0.1]*10) # float accuracy issues do occur
# Probability sum: 0.9999999999999999
#
# Shannon entropy: 3.321928094887362
# Shannon perplexity: 9.999999999999998
#
# NLTK entropy: 3.3219280948873626
# NLTK perplexity: 10.000000000000002

# Jurafsky perplexity: 10.0

# some examples with UNequal probability
# 'weighted' coin toss
uncertainty([0.7, 0.3])
# Probability sum: 1.0
#
# Shannon entropy: 0.8812908992306927
# Shannon perplexity: 1.8420227750373133
#
# NLTK entropy: 1.1257693834979823
# NLTK perplexity: 2.182178902359924

# Jurafsky perplexity: 2.182178902359924

uncertainty([0.1,0.25,0.35,0.15, 0.15])
# Probability sum: 1.0
# 
# Shannon entropy: 2.1833830982290134
# Shannon perplexity: 4.542174393148615
# 
# NLTK entropy: 2.4620864912099063
# NLTK perplexity: 5.5101305142271135
# 
# Jurafsky perplexity: 5.510130514227115

# tested with a very large, random probability list
uncertainty(random.dirichlet(ones(5000),size=1).tolist()[0])
# Probability sum: 0.9999999999999974
# 
# Shannon entropy: 11.690225456458114
# Shannon perplexity: 3304.5210178220477
# 
# NLTK entropy: 13.104676557913615
# NLTK perplexity: 8808.475027062655
# 
# Jurafsky perplexity: inf

@iliakur iliakur self-assigned this Mar 17, 2022
@iliakur
Copy link
Contributor

iliakur commented Mar 22, 2022

I'm still a bit tight on time, but I can at least provide some leads. I can confirm Tom's hunch: the entropy/perplexity formula is taken verbatim from Jordan Boyd-Graber. He was kind enough to share a script in this comment (the lm.txt file). I'm sure that's the Jurafsky formula.

From scanning your comment, @mbauwens it seems you've arrived at something similar independently. Does this mean all is well, or should I do some more digging?

@mbauwens
Copy link
Contributor Author

mbauwens commented Mar 24, 2022

Well, this cleared up where the formula is from, yes! In that sense, the issue can be closed.

The only 'issue' I still have is that this formula of entropy-perplexity is less documented, apparently, whereas the Shannon Entropy formula is more commonly found. Don't know which one is 'better' though (I assume they're just different). Future users might benefit from an update. So, not sure if it's worth it, but it might be nice to include a reference to Jordan Boyd-Graber/Jurafsky in the entropy method description and/or including shannon_entropy as a different method?

  • add reference to JBG/Jurafsky to entropy and perplexity method in description
  • add shannon_entropy as a new method
  • update documentation to reflect these changes (but this isn't done via GitHub, I assume)

I'm pretty new to contributing to open source git projects but the how-to is pretty clear and the edit is very straightforward so I'd be okay with suggesting it via PR - if you agree with the statement above. If you don't believe this is necessary for the NLTK core and this is something users should add themselves (like I now did in my project), that's also fine. 🙂

Either way, thanks for having taken the time to look into my question 👍

@iliakur
Copy link
Contributor

iliakur commented Mar 25, 2022

👍 from me for your plan. Adding methods and docs is generally not bad imo, we should just make sure we take this opportunity to think about the relevant interfaces. Ping me when you make the PR, I'll gladly do the review. Or ping me if you need help in the process of making the PR too :)

Thanks for your work on this!

@tomaarsen
Copy link
Member

update documentation to reflect these changes (but this isn't done via GitHub, I assume)

This is done via GitHub, actually. The API Reference pages, such as this page is automatically generated from comments in https://github.com/nltk/nltk/blob/develop/nltk/lm/api.py. You can change the comments in the code, and after a website push, the site will be updated.

Similarly, the Example Usage pages, such as this page are generated from doctest files like this one: https://github.com/nltk/nltk/blob/develop/nltk/test/lm.doctest. That said, some of these doctest files are really simply just tests, while others actually explain how to use the sections.

@mbauwens
Copy link
Contributor Author

Will do, @iliakur !
Thanks for the tips @tomaarsen !

I'm actually investigating some variants of entropy because I've come across some practical issues/limitations when using the measure. Would it be okay if I suggest some optional parameters to the entropy function so that the method allows for more flexibility? I am planning to test whether the parameters do improve the quality of the measure and, if so, I'm going to document it thoroughly. Still a work in progress, but I do plan to finish it by 17 June (right on time to possibly present it at the CLIN 32 conference in Tilburg).

@mbauwens
Copy link
Contributor Author

mbauwens commented Apr 1, 2022

Update: I have found the reasoning behind negative average log probability as a measure for entropy (Shannon-McMillan-Breiman theorem). It's actually mentioned in chapter 3 of Speech and Language Processing (Jurafsky and Martin, 2021). I probably overlooked it as it was mainly explained using mathematical formulas (and I'm a mere computational linguist 😃). I'll include this information in my suggested PR as it does clarify some things.

@iliakur
Copy link
Contributor

iliakur commented Apr 3, 2022

@mbauwens optional parameters should be ok as long as their default values result in existing behavior. The idea being that folks who aren't passing these parameters explicitly should not have their code break all of a sudden :)

@mbauwens
Copy link
Contributor Author

mbauwens commented Apr 3, 2022

@mbauwens optional parameters should be ok as long as their default values result in existing behavior. The idea being that folks who aren't passing these parameters explicitly should not have their code break all of a sudden :)

Ok, that was also what I was planning 😃 Do you prefer a new method shannon_entropy() or including a parameter in the existing entropy method which defines the type of entropy used?

@iliakur
Copy link
Contributor

iliakur commented Apr 3, 2022

could you outline the two alternatives a bit more? Just the level of method signatures should be enough to help us decide between the two.

@mbauwens
Copy link
Contributor Author

sorry @iliakur for the late reply! Was a bit busy and had to still try out some things. I've just presented the alternatives on Computational Linguistics in The Netherlands (CLIN32) and wrote a demo in a GitHub repo: https://github.com/mbauwens/clin32-entropy

Basically, the different variations of entropy would be:

  • Entropy (as in Shannon-McMillan-Breiman): the current entropy in nltk
  • Shannon entropy (negative sum of prob x log prob). The naming scheme of shannon_entropy() would be confusing though, since the current entropy is also a derivation of Shannon entropy.
    • Length normalised (divided by information length)
    • Relative frequency (probability is weighted by the relative frequency of the trigram and if tokens are present, they are weighted by the minimal relative frequency). This version of entropy would require a frequency distribution as a parameter.

With the final length normalised relative frequency weighted, I did seem to receive promising results.

I think it would be most elegant if the functions would be separated, but it would be more efficient and compact to have a single function with parameters.

@iliakur
Copy link
Contributor

iliakur commented Jan 19, 2024

@mbauwens should I close this issue?

@nilinykh
Copy link

This might not be the question directly related to the discussion here, but, perhaps, it's a good place (and time) to ask the question: what would be the theoretical and methodological support to use the currently implemented formula for computing entropy, e.g. the one that does not have p(x) in it, but only log probs? I am trying to motivate the implementation in NLTK as it has been criticised by some of my colleagues for not being a correct one. The correct one would be the definition by Shannon. Any references or hints on that topic?

@iliakur
Copy link
Contributor

iliakur commented Jan 23, 2024

This might not be the question directly related to the discussion here, but, perhaps, it's a good place (and time) to ask the question: what would be the theoretical and methodological support to use the currently implemented formula for computing entropy, e.g. the one that does not have p(x) in it, but only log probs? I am trying to motivate the implementation in NLTK as it has been criticised by some of my colleagues for not being a correct one. The correct one would be the definition by Shannon. Any references or hints on that topic?

Based on this comment I disagree with the claim that the current implementation is incorrect. The comment clearly states that the different variations of entropy are the 2 we're discussing. In fact, Shannon also had a hand in the one that's currently implemented.

To address your request for more support of the current version, see this comment. Basically the current implementation is based on the Jurafsky book and a well-known nlp researcher's code.

@mbauwens
Copy link
Contributor Author

This might not be the question directly related to the discussion here, but, perhaps, it's a good place (and time) to ask the question: what would be the theoretical and methodological support to use the currently implemented formula for computing entropy, e.g. the one that does not have p(x) in it, but only log probs? I am trying to motivate the implementation in NLTK as it has been criticised by some of my colleagues for not being a correct one. The correct one would be the definition by Shannon. Any references or hints on that topic?

Based on this comment I disagree with the claim that the current implementation is incorrect. The comment clearly states that the different variations of entropy are the 2 we're discussing. In fact, Shannon also had a hand in the one that's currently implemented.

To address your request for more support of the current version, see this comment. Basically the current implementation is based on the Jurafsky book and a well-known nlp researcher's code.

I agree that this is not an incorrect implementation, but just a specific implementation.

@mbauwens should I close this issue?

My bad for the long time in between creating this issue and closing it! I've added PR #3229, which reflects the outcome of this discussion in the documentation. Because of different priorities, I haven't added the new functionality as proposed by myself in this thread. This PR should close this issue!

@tomaarsen tomaarsen linked a pull request Jan 29, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants