Skip to content

sanjanalreddy/NLP-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 

Repository files navigation

NLP-Datasets

List of datasets which I have come across for multiple tasks. It is work in progress and not exhaustive.

Single Turn Datasets

  1. (MIT Movie Corpus) The MIT Movie Corpus is a semantically tagged training and test corpus in BIO format. [Dataset]
  2. (MIT Restaurant Corpus) The MIT Restaurant Corpus is a semantically tagged training and test corpus in BIO format. [Dataset]

Multi-Turn Dialog Datasets

  1. (Ubuntu Dialogue Corpus) The Ubuntu Dialogue Corpus : A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, 2015 [Paper] [Data]
  2. (Frames) Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems [Paper] [Data]
  3. (DST 2 & 3) Dialog State Tracking Challenge 2 & 3 [Paper] [Data]
  4. (Cambridge CamRest Dataset) Conditional Generation and Snapshot Learning in Neural Dialogue Systems [Paper & Data]
  5. (Stanford Multi-Turn Multi-Task Dataset) Task-Oriented dataset in domains- weather information, POI navigation and Calendar Scheduling [Paper & Data]
  6. (Customer Support Tweets and Replies) A large, modern corpus of tweets and replies to aid innovation in natural language understanding and conversational models, and for study of modern customer support practices and impact. [Data]
  7. (MultiWOZ Dataset) Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. [Data MultiWOZ 1.0][Data MultiWOZ 2.0]
  8. (DailyDialog) This dialogues in the dataset reflect our daily communication way and cover various topics about our daily life which include ordinary life, politics, finance, health, work, tourism, relationships, etc. This dataset also contains annotated emtion utterances. [Paper][Dataset]
  9. (Cornell Movie Dataset) This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. [Dataset]

CoReference Datasets

  1. (OntoNotes Dataset 5.0) OntoNotes Release 5.0 [LDC Data] [CONLL 2012 DATA]

Miscellaneous

  1. (Multimodal comprehension) RecipeQA [Webpage] [Paper]

Releases

No releases published

Packages

No packages published