Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open Access Reference corpus #756

Open
jmccrae opened this issue Sep 16, 2021 · 2 comments
Open

Open Access Reference corpus #756

jmccrae opened this issue Sep 16, 2021 · 2 comments
Labels
documentation More documentation is needed or there are errors in the documentation help wanted Extra attention is needed

Comments

@jmccrae
Copy link
Member

jmccrae commented Sep 16, 2021

The current guidelines for new synsets, state that the lemma must have at least 100 occurrences in Sketch Engines's TenTen corpus.

https://github.com/globalwordnet/english-wordnet/blob/master/NEW_SYNSETS.md

This corpus is only accessible to paying Sketch Engine customers and so does not really fit with our open-source goals. We should update this to an open access corpus such as the American National Corpus.

Any suggestions?

@jmccrae jmccrae added help wanted Extra attention is needed documentation More documentation is needed or there are errors in the documentation labels Sep 16, 2021
@jmccrae jmccrae added this to the 2022 Release milestone Sep 16, 2021
@jmccrae jmccrae changed the title Reference corpus Open Access Reference corpus Sep 16, 2021
@jmccrae jmccrae removed this from the 2022 Release milestone Aug 16, 2022
@arademaker
Copy link
Member

  1. EWT and Ontonotes from https://github.com/propbank/propbank-release
  2. English UD corpora

@jmccrae
Copy link
Member Author

jmccrae commented May 26, 2023

Thanks @arademaker, both of those corpora are quite small and I don't think they would suit our needs.

@fcbond has suggested the use of the CoCA corpus and I think this seems quite suitable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation More documentation is needed or there are errors in the documentation help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants