NLP-Datasets

List of datasets which I have come across for multiple tasks. It is work in progress and not exhaustive.

Single Turn Datasets

(MIT Movie Corpus) The MIT Movie Corpus is a semantically tagged training and test corpus in BIO format. [Dataset]
(MIT Restaurant Corpus) The MIT Restaurant Corpus is a semantically tagged training and test corpus in BIO format. [Dataset]

Multi-Turn Dialog Datasets

(Ubuntu Dialogue Corpus) The Ubuntu Dialogue Corpus : A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, 2015 [Paper] [Data]
(Frames) Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems [Paper] [Data]
(DST 2 & 3) Dialog State Tracking Challenge 2 & 3 [Paper] [Data]
(Cambridge CamRest Dataset) Conditional Generation and Snapshot Learning in Neural Dialogue Systems [Paper & Data]
(Stanford Multi-Turn Multi-Task Dataset) Task-Oriented dataset in domains- weather information, POI navigation and Calendar Scheduling [Paper & Data]
(Customer Support Tweets and Replies) A large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact. [Data]
(MultiWOZ Dataset) Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. [Data MultiWOZ 1.0][Data MultiWOZ 2.0]
(DailyDialog) This dialogues in the dataset reflect our daily communication way and cover various topics about our daily life which include ordinary life, politics, finance, health, work, tourism, relationships, etc. This dataset also contains annotated emtion utterances. [Paper][Dataset]
(Cornell Movie Dataset) This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. [Dataset]

CoReference Datasets

(OntoNotes Dataset 5.0) OntoNotes Release 5.0 [LDC Data] [CONLL 2012 DATA]

Miscellaneous

(Multimodal comprehension) RecipeQA [Webpage] [Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

NLP-Datasets

Single Turn Datasets

Multi-Turn Dialog Datasets

CoReference Datasets

Miscellaneous

About

Releases

Packages

sanjanalreddy/NLP-Datasets

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

NLP-Datasets

Single Turn Datasets

Multi-Turn Dialog Datasets

CoReference Datasets

Miscellaneous

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages