GitHub - mariaschuld/speaker-landscapes: Contains notebooks to create speaker landscapes, or spatial representations of speakers in a debate where distance measures the similarity of their contributions. Speaker landscapes

This repository contains code to supplement the paper "Speaker Landscapes: Machine Learning Opens a Window on the Everyday Language of Opinion".

Speaker landscapes are a kind of user embedding, or a spatial representation of speakers by vectors such that the proximity between two speaker vectors is a measure for the similarity of their speech samples from a data corpus. The basic trick of constructing a speaker landscape is to annotate data of speech samples with a token representing the speaker and then training a word embedding.

The main folder shows how to create and analyse speaker landscapes using dummy datasets to highlight the kind of datastructures that are inputs and outputs to the code. The real dataset are available on request, subject to the data owner's permission.

The two word embeddings studied in the paper can be found in the "case_studies" folder. The speaker tokens are marked by the prefix agent_.

How to train and use speaker landscapes

Environment setup

Install all packages from the requirements.txt file into a fresh python environment and activate the environment, so that the notebooks have access to the packages.

1. Providing the data

Save your raw data as a json lines file (.jl) in the main folder where the notebooks are located. The file has to be named raw_data.jl and has the following structure:

{"author": "<name_speaker1>", "text": "<quote1>"}
{"author": "<name_speaker2>", "text": "<quote2>"}
...

Each line represents one text quote by an author. Authors with multiple quotes in the text corpus will appear in multiple lines.

Each line contains a python dictionary whith (at least) two string keys, "author" and "text". The expressions in the brakets <> contain the specific data you are providing.

2. Pre-processing the data

Open the clean_data.ipynb notebook and run all cells in consecutive order.

The main folder now contains a new file called clean_data.txt, which is a simple text file of the form

agent_<name_speaker1> <cleaned_quote1>
agent_<name_speaker2> <cleaned_quote2>
...

The quote text is cleaned as follows:

use lower casing,
remove punctuation except # and @,
form expressions of up to 4-grams when words are used in the same order more than 70 times,

3. Training the word embedding

Open the train_embedding.ipynb notebook and run all cells in consecutive order.

The main folder now contains a new file called word_embedding.emb. This stores the embedding (i.e. a mapping from word to vectors) in gensim's KeyedVector format.

Note: Since we use multiple workers for training, each training may result in a slightly different embedding, even thought the random seeds in the notebook are fixed.

4. Extract information for the speaker landscape

The speaker landscape consists of all word-vector pairs in the trained embedding that are prepended by "agent_". These words are the speaker tokens, while the vectors are their coordinates in a 250 dimensional space. To visualise the landscape, one can reduce the 250 to 2 dimensions.

Open the extract_landscape.ipynb notebook and run all cells in consecutive order.

This creates a new file landscape_info.pkl which stores a pandas DataFrame with the author, their quotes, vector representation and reduced vector representation.

Another file, annotations_info.pkl contains the word, vector, and reduced vector representation of desired annotation words, which are selected from the embeddings vocabulary.

5. Plot and analyse the speaker landscape

The analyse_landscape.ipynb notebook shows techniques used in the main paper to analyse and visualise the landscape, including:

Plotting an annotated landscape
Projecting to list of anger words
Counting the tweet length
Counting emojis

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
case_studies		case_studies
.gitignore		.gitignore
README.rst		README.rst
analyse_landscape.ipynb		analyse_landscape.ipynb
annotations_info.pkl		annotations_info.pkl
clean_data.ipynb		clean_data.ipynb
clean_data.txt		clean_data.txt
extract_landscape.ipynb		extract_landscape.ipynb
landscape_info.pkl		landscape_info.pkl
raw_data.jl		raw_data.jl
requirements.txt		requirements.txt
train_embedding.ipynb		train_embedding.ipynb
word_embedding.emb		word_embedding.emb

mariaschuld/speaker-landscapes

Folders and files

Latest commit

History

Repository files navigation

How to train and use speaker landscapes

Environment setup

1. Providing the data

2. Pre-processing the data

3. Training the word embedding

4. Extract information for the speaker landscape

5. Plot and analyse the speaker landscape

About

Resources

Stars

Watchers

Forks

Languages