Skip to content

timbeiko/classify_coherence

Repository files navigation

Classify Coherence

An attempt to classify sentences from the Penn Discourse Treebank as either coherent or incoherent.

Getting Started

To get started, you will need acces to the Penn Discourse Tree Bank (PDTB) (the CLaC lab has the data). Once you have this, add a PDTB relations-XX-XX-XX-{dev | train | test}.json file to the /data directory, and update the value of relations_json in generate_sentences.py (declared around line 10) to the name of that file.

In order to train with the Google News word2vec embeddings, you will need to download them (available here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM), and unzip them in the \model directory.

Functionality breakdown

Below are brief explanations of what each python file does, and how it interacts with other files.

generate_sentences.py

This script takes a PDTB .json file from the /data directory and create the coherent and incoherent datasets for our model from it. It first reads the .json and extracts relevant values from it, then it creates the various datasets (specifying which one in the comments) in .json format.

preprocess.py

This file takes the resulting .json files from generate_sentences.py and converts them to .txt. It also calculates corpus-wide statistics, such as the number of terms, the dictionary, the maximum sentence length, etc.

train.py

This file is where we run our convolutional neural network. It defines many flags for the various hyperparameters, then loads the data from specified file and transforms it to the format needed by the network. It then implements the training loop for the network, and saves intermediate results, while printing relevant information to the screen.

cnn.py

This is the class that implements the underlying logic of our convolutional neural network. It creates the actual network, connects layers, implements the convolution, etc.

randomize_words.py

This file was used to randomize the Arg2 of our training data, using several gamma values that specify the probability of each word of being swapped with another.

crowdflower_sampling.py

This file randomly chooses a fixed number of samples from each of our datasets to get them annotated on Crowdflower.

connective_middle_frequency.py

This file looks at which connectives in our data are more likely to be in the middle of Arg1 and Arg2. It was not used for our experiments, but is still in the code because it would be useful if we want to experiment with unannotated data in the future.

Folder structure breakdown

/ Main folder, contains the python files, as well as README and the vocabulary

/crowdflower_data Data uploaded to Crowdflower for manual annotations

/data Where we store the various datasets used in the project. The files in this folder directly are the PDTB files, along with data about the corpus (corpus_stats.txt, dictionary.txt)

/data/json .json files generated by generate_sentences.py

/data/model word2vec Google News model (3GB in size, available for download at https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM)

/data/random Randomized datasets with different values of gamma, output of randomize_words.py

/data/rt-polaritydata Rotten Tomatoes reviews, used to evaluate our model

/data/txt Input data for train.py constructed from the PDTB data

/runs Model parameters saved by tensorflow after each run

Detailed Report

A detailled report of this project can be found at https://www.overleaf.com/read/ngfcbdxkcgby

About

An attempt to corrupt & classify sentences from the Penn Discourse Treebank as either coherent or incoherent.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages