Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues related to CHILDES Corpus Reader #3079

Closed
Arthur-Kan opened this issue Dec 3, 2022 · 5 comments
Closed

issues related to CHILDES Corpus Reader #3079

Arthur-Kan opened this issue Dec 3, 2022 · 5 comments
Labels

Comments

@Arthur-Kan
Copy link

Arthur-Kan commented Dec 3, 2022

Greetings. I am working on a child language project and would like to use the CHILDES Corpus Reader package to analyze children's language data. However, the methods do not output anything. I am trying with the Valian Corpus in the XML version (the link for downloading the XML version of Valian corpus is [(https://childes.talkbank.org/data-xml/Eng-NA/)]

Heres the code I used:

import nltk
from nltk.corpus.reader import CHILDESCorpusReader
valian = CHILDESCorpusReader('./Valian', '.*.xml')
valian.fileids()

#print words
valian.words('./Valian/01a.xml')

#print sentences
valian.sents('./Valian/01a.xml')

#print MLU
valian.MLU('./Valian/01a.xml')

Here is what the output for words, sentences and MLU look like:

>>> valian.words('/01a.xml')
[]
>>> valian.sents('/01a.xml')
[]
>>> valian.MLU('/01a.xml')
[0]
>>> 

Thank you very much for your help!!

@tomaarsen
Copy link
Member

Hello @Arthur-Kan,

I'm struggling a bit to reproduce these issues, and I think the reason is that the fileids that you must specify to valian.words, valian.sents or valian.MLU are very specific, i.e. they must correspond exactly with the output of valian.fileids().

To debug that, I'd like to ask whether this script work for you?

from nltk.corpus.reader import CHILDESCorpusReader
valian = CHILDESCorpusReader('./Valian', '.*.xml')

fileids = valian.fileids()

#print words
print(valian.words(fileids[0]))

#print sentences
print(valian.sents(fileids[0]))

#print MLU
print(valian.MLU(fileids[0]))

If not, could you also specify your NLTK version, so I can figure out if it's version specific? I mention this, because somewhat recently we had some issues with the CHILDES corpus not parsing correctly (#2997).

  • Tom Aarsen

@Arthur-Kan
Copy link
Author

Hello Tom,

Thank you for your response. My NLLTK version is 3.7. I have just tried the codes you suggested, the output is still the same. I tried other functions as well, such as tagged_words() and tagged_sent(), they all return an empty list, except for MLU which returns a zero. I have asked my friends to try the same codes, and also the ones you suggested, they all don't seem to be working. Could you please take a look? Thank you so much for your help!

>>> print(valian.words(fileids[0]))
[]
>>> print(valian.sents(fileids[0]))
[]
>>> print(valian.MLU(fileids[0]))
[0]

Best,
Arthur

@tomaarsen
Copy link
Member

Hey @Arthur-Kan,

NLTK 3.7 is the most recent release, although we've been working on various fixes since then on the develop branch. As it turns out, the fix for #2997 was introduced after NLTK 3.7 was published. With other words, this is a bug that was introduced just before 3.7, but hasn't been pushed to a full release yet.

We plan to create a new release some time in the next week, and until then you can use the develop branch, for example with

pip install git+https://github.com/nltk/nltk.git
  • Tom Aarsen

@Arthur-Kan
Copy link
Author

Hello Tom,

Thank you very much for the information, I think I can wait till the new release next week. May I ask if the information related to the release will be said on NLTK website, or how could I tell when it is released?

Looking forward to the updated version, thank you for your help!

Best wishes,
Arthur

@tomaarsen
Copy link
Member

Hello @Arthur-Kan,

The newest update, NLTK 3.8, is out now! See the Release Notes on the website or the ChangeLog on the repo for more information on the release. I think I can close this now, as it should be fixed. Let us know if you experience issues still!

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants