issues related to CHILDES Corpus Reader #3079

Arthur-Kan · 2022-12-03T18:07:58Z

Greetings. I am working on a child language project and would like to use the CHILDES Corpus Reader package to analyze children's language data. However, the methods do not output anything. I am trying with the Valian Corpus in the XML version (the link for downloading the XML version of Valian corpus is [(https://childes.talkbank.org/data-xml/Eng-NA/)]

Heres the code I used:

import nltk
from nltk.corpus.reader import CHILDESCorpusReader
valian = CHILDESCorpusReader('./Valian', '.*.xml')
valian.fileids()

#print words
valian.words('./Valian/01a.xml')

#print sentences
valian.sents('./Valian/01a.xml')

#print MLU
valian.MLU('./Valian/01a.xml')

Here is what the output for words, sentences and MLU look like:

>>> valian.words('/01a.xml')
[]
>>> valian.sents('/01a.xml')
[]
>>> valian.MLU('/01a.xml')
[0]
>>>

Thank you very much for your help!!

The text was updated successfully, but these errors were encountered:

tomaarsen · 2022-12-05T09:07:17Z

Hello @Arthur-Kan,

I'm struggling a bit to reproduce these issues, and I think the reason is that the fileids that you must specify to valian.words, valian.sents or valian.MLU are very specific, i.e. they must correspond exactly with the output of valian.fileids().

To debug that, I'd like to ask whether this script work for you?

from nltk.corpus.reader import CHILDESCorpusReader
valian = CHILDESCorpusReader('./Valian', '.*.xml')

fileids = valian.fileids()

#print words
print(valian.words(fileids[0]))

#print sentences
print(valian.sents(fileids[0]))

#print MLU
print(valian.MLU(fileids[0]))

If not, could you also specify your NLTK version, so I can figure out if it's version specific? I mention this, because somewhat recently we had some issues with the CHILDES corpus not parsing correctly (#2997).

Tom Aarsen

Arthur-Kan · 2022-12-08T00:37:46Z

Hello Tom,

Thank you for your response. My NLLTK version is 3.7. I have just tried the codes you suggested, the output is still the same. I tried other functions as well, such as tagged_words() and tagged_sent(), they all return an empty list, except for MLU which returns a zero. I have asked my friends to try the same codes, and also the ones you suggested, they all don't seem to be working. Could you please take a look? Thank you so much for your help!

>>> print(valian.words(fileids[0]))
[]
>>> print(valian.sents(fileids[0]))
[]
>>> print(valian.MLU(fileids[0]))
[0]

Best,
Arthur

tomaarsen · 2022-12-08T09:09:08Z

Hey @Arthur-Kan,

NLTK 3.7 is the most recent release, although we've been working on various fixes since then on the develop branch. As it turns out, the fix for #2997 was introduced after NLTK 3.7 was published. With other words, this is a bug that was introduced just before 3.7, but hasn't been pushed to a full release yet.

We plan to create a new release some time in the next week, and until then you can use the develop branch, for example with

pip install git+https://github.com/nltk/nltk.git

Tom Aarsen

Arthur-Kan · 2022-12-10T23:31:39Z

Hello Tom,

Thank you very much for the information, I think I can wait till the new release next week. May I ask if the information related to the release will be said on NLTK website, or how could I tell when it is released?

Looking forward to the updated version, thank you for your help!

Best wishes,
Arthur

tomaarsen · 2022-12-12T18:01:33Z

Hello @Arthur-Kan,

The newest update, NLTK 3.8, is out now! See the Release Notes on the website or the ChangeLog on the repo for more information on the release. I think I can close this now, as it should be fixed. Let us know if you experience issues still!

Tom Aarsen

tomaarsen added the corpus label Dec 5, 2022

tomaarsen closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues related to CHILDES Corpus Reader #3079

issues related to CHILDES Corpus Reader #3079

Arthur-Kan commented Dec 3, 2022 •

edited

tomaarsen commented Dec 5, 2022

Arthur-Kan commented Dec 8, 2022

tomaarsen commented Dec 8, 2022

Arthur-Kan commented Dec 10, 2022

tomaarsen commented Dec 12, 2022

issues related to CHILDES Corpus Reader #3079

issues related to CHILDES Corpus Reader #3079

Comments

Arthur-Kan commented Dec 3, 2022 • edited

tomaarsen commented Dec 5, 2022

Arthur-Kan commented Dec 8, 2022

tomaarsen commented Dec 8, 2022

Arthur-Kan commented Dec 10, 2022

tomaarsen commented Dec 12, 2022

Arthur-Kan commented Dec 3, 2022 •

edited