Skip to content

SfS-unsupervisedCL/project-sound-correspondences

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Identifying Sound Correspondence Rules in Bilingual Wordlists

The project report can be found in the /doc folder.

Motivation

Identifying regular, systematic sound correspondences between the vocabularies of two or more languages is a key method of historical linguistics. Given a large set of such correspondences, it can be surmised that the languages share a common ancestor.

To establish these correspondences, we can first try to identify potential cognates -- words that share a common origin (cognates often have retained similar meanings, but not always). The next step is to establish regular correspondences between the potential cognates. The underlying sound changes can be conditioned by specific contexts only (ie., specific neighbouring phonemes, position of a phoneme within a word or syllable, etc.).

Example: (medial/final) /t/ : /s/ in Swedish and German

Swedish German
(to) eat ɛːta ɛsən
white viːt vaɪ̯s
(to) measure mɛːta mɛsən
(to) flow flyːta fliːsən
foot fuːt fuːs

Research questions (beyond the scope of just this one term project):

  • Can we identify sound correspondence rules between cognate languages in an unsupervised manner?
  • How well does this work: How complicated can the rule contexts be? How does such an unsupervised system compare to similar supervised methods or to non-computational methods?

How to run the code

We adapted the structure of the imports to allow relative imports from preprocessing into tree. In order to run the scripts with this import structure, it is necessary to run them as modules from the parent directory (i.e., project-sound-correspondences), e.g.

python -m preprocessing.merge_lists deu data/deu.csv swe data/swe.csv data
python -m preprocessing.features data/deu-swe-all.csv data/ipa_numerical.csv 0.4 0.9
python -m tree.tree data/deu-swe-features.csv output
python -m test.ruletest -v
python -m evaluation.evaluation deu swe data/deu-swe-all.csv data/ipa_numerical.csv output

Method

Data:

  • see below. We will pick languages from language families that we are familiar with (Germanic, Slavic, Romance).

Preprocessing:

  • construct bilingual word lists
  • encode IPA symbols as collections of values for phonetic features
    • maybe like this:
    • TSV file where more IPA characters and/or features can easily be added (first column: IPA character, subsequent columns: features)
    • class for sound instances where fields store the values for the features
      • the advantage of doing this instead of using feature dictionaries (eg. manner['b']='plosive') is that we can also use instances of this class during imputation & evalutation.
  • eliminate unlikely candidates for cognates
    • write a modified version of the Levenshtein distance that computes the replacement costs based on the sounds' phonetic features
      • ie., if we have n features, then each feature change costs 1/n
      • we could consider giving different weights to different features
      • for features with values that fall on a scale, we could even consider giving different weights to different changes?
    • write a method that take list_of_pairs and threshold_value as arguments and return the possible cognates
    • determine a good threshold (NED = 0.5?)

The first 10 potential cognates determined by utils.get_cognates(file='data/deu-swe-all.csv', threshold=0.5):

  • # is the initial word boundary; * is the empty string (used for insertion/deletion).
  • The format is (German, Swedish, edit distance).
(['#', 'a', 'ʊ̯', 'ɡ', 'ə'], ['#', 'øː', '*', 'ɡ', 'a'], 0.29)
(['#', 'oː', '*', 'ɐ̯'], ['#', 'œː', 'r', 'a'], 0.36)
(['#', 'n', 'aː', 'z', 'ə'], ['#', 'n', 'ɛː', 's', 'a'], 0.09)
(['#', 'm', 'ʊ', 'n', 't'], ['#', 'm', 'ɵ', 'n', '*'], 0.24)
(['#', 't͡s', 'aː', 'n', '*'], ['#', 't', 'a', 'nː', 'd'], 0.27)
(['#', 't͡s', 'ʊ', 'ŋ', 'ə'], ['#', 't', 'ɵ', 'ŋː', 'a'], 0.13)
(['#', 'l', 'ɪ', 'p', 'ə'], ['#', 'l', 'ɛ', 'pː', '*'], 0.27)
(['#', 'v', 'a', 'ŋ', 'ə'], ['#', 'ɕ', 'ɪ', 'nː', 'd'], 0.33)
(['#', 'ɡ', 'ə', 'z', '*', 'ɪ', 'ç', 't', '*'], ['#', '*', 'a', 'nː', 's', 'ɪ', 'k', 't', 'ə'], 0.41)
(['#', 'h', 'aː', 'ɐ̯'], ['#', 'h', 'oː', 'r'], 0.31)

The same for utils.get_cognates(file='data/rus-ukr-all.csv', threshold=0.5):

  • (Russian, Ukrainian, edit distance)
(['#', '*', 'uˑ', 'x', 'ə'], ['#', 'w', 'u', 'x', 'ɔ'], 0.27)
(['#', 'n', 'ɔˑ', 's'], ['#', 'nʲ', 'i', 's'], 0.11)
(['#', 'r', 'ɔˑ', 't'], ['#', 'r', 'ɔ', 't'], 0.03)
(['#', 'z', 'uˑ', 'p'], ['#', 'z', 'u', 'b'], 0.06)
(['#', 'j', 'ɐ', 'z', 'ɨˑ', 'k'], ['#', 'j', 'ɑ', 'z', 'ɪ', 'k'], 0.09)
(['#', 'ɡ', 'u', 'b', 'aˑ'], ['#', 'ɦ', 'u', 'b', 'ɑ'], 0.09)
(['#', 'ʃʲː', '*', 'ɪ', 'k', 'aˑ'], ['#', 'ʂ', 'ʈ͡ʂ', 'ɔ', 'k', 'ɑ'], 0.3)
(['#', 'ɫ', 'ɔˑ', 'p'], ['#', 'l', 'ɔ', 'b'], 0.11)
(['#', 'v', 'ɔˑ', 'ɫ', 'ə', 's', '*'], ['#', 'w', 'ɔ', 'l', 'ɔ', 'sʲː', 'ɑ'], 0.3)
(['#', 'v', 'ɔˑ', 'ɫ', 'ə', 's', 'ɨ'], ['#', 'w', 'ɔ', 'l', 'ɔ', 'sʲː', 'ɑ'], 0.19)

Method (feature selection based on Wettig et al. 2012; see also the slides in doc folder):

  • align word pairs on a symbol level
    • use the phone distances for the alignment (instead of vanilla Needleman-Wunsch)
    • encode empty strings as phones
    • prefix a special word-boundary phone to each word
  • create feature- and level-based decision trees for the aligned symbols (input: source sound, output: target sound)
    • position features: identify previous vowel/consonant etc. for each symbol
    • for each phone in a word, determine corresponding phones for each position and create feature sets (e.g. sourceLang_itself_voiced=true, targetLang_prevConsonant_manner=plosive, etc.)
    • transform the (string) features into features the decision tree packages can work with (integers) (this way, we could also take advantage of the implicit scales that some of the features describe, e.g. vowel height, place of consonant articulation, etc.)
    • write a method generate_instances(levels, positions, features, n_samples) that returns a matrix of size (n_samples x n_features) and a parallel list[str] which contains a list of all the features types
    • for each level (i.e. source, target) and feature combination (e.g. target_manner), create a set of labelled feature sets
    • build one tree for each level-feature combination (see below in the Available data, tools, resources section for package options)
    • export the trees
    • export the rules
      • Perform a tree-traversal to get the rules.
      • Merge rules that describe sibling leaf nodes that predict the same class.
      • Switch back from numerical values to categorical ones.
    • Possible improvements:
      • trim the trees: play around with values for min_samples_split/min_samples_leaf, min_impurity_decrease
        • Unfortunately, it appears that sklearn does not allow us to prevent splits that result in sibling nodes predicting the same category.
      • exclude prevOrSelf_ in certain cases

Evaluation:

  • imputation and normalized edit distance using the modified NED/Levenshtein distance from the preprocessing step

would be interesting, but too time-consuming:

  • human-readable version: transform imputed sounds into IPA symbols
    • /!\ some feature combinations are impossible and cannot be represented by IPA symbols
    • use some sort of error symbol to mark impossible feature combinations?
    • alternative: pick closest IPA symbol (within reason?)
  • compute precision values for the generated rules (lit research!)
  • doing the literature research necessary for calculating recall/F1-score would be likely be extremely time-consuming

Notes

  • We originally considered working with PMI-based one-to-one alignments of symbols, but switched to the decision-tree-based induction of correspondence rules based on phonetic features and contextual information (similar to Wettig et al. 2012) because we think that this might help us capture more complex correspondences.
  • The encoding of phonetic features in Wettig et al. is somewhat particular to Uralic languages.
    • 2-3 different vowel lengths should be enough (3 because the Russian words include half-long sounds).
    • +/- nasalization feature for vowels (Romance languages)
    • Including stress could be informative, but that information is not included in the NorthEuraLex dataset/
    • We expanded the "secondary features of consonantal articulation" category (palatalization).
    • Time permitting, we could make the features configurable so that it would be possible to add additional features (tone, airstream etc.).
  • Our method deviates from Wettig et al.'s method in one significant way: They start with random alignments, and improve the alignments and trees in an expectation-maximization-like algorithm. We start with phonetically motivated assignments, which already yield satisfactory results, and build the trees just once.
    • We need to include a phonetic distance measure anyway to determine potential cognates within our data sets, since we do not have a cognate-only word list.
  • The positional features only capture a symbol's left-hand context. In future projects, including the right-hand context would be interesting.

Relevant literature

A decision-tree algorithm using contexts and phonological features:

A decision tree algorithm taking into account each segment's left and right contexts:

Unsupervised cognate identification (incl. identification of sound correspondences) using PMI:

This paper uses multiple characteristics for identifying cognates ("recurrent sound correspondences, phonetic similarity, and semantic affinity") and gives an overview of several existing approaches:

The following paper uses phonetic information and a Levenshtein-like algorithm for transforming a word into a translation that is a (potential) cognate:

more:

Available data, tools, resources

Data:

  • NorthEuraLex: Johannes Dellert and Gerhard Jäger (eds.). 2017. NorthEuraLex (version 0.9).
    • a database containing wordlists for many languages spoken in Northern Eurasia. The wordlists consist of translations of >1.000 concepts. The words have been transcribed into IPA.

Packages for decision tree learning:

We chose the sklearn package since it allowed for the easiest integration with our other python code. Unfortunately, none of these decision tree classifier packages allow for rule extraction, so we implemented that by hand.

Project members

About

Sound correspondence rule induction.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages