- Website
- Latest Release
v0.8 (2022-10-02)
- License
MIT License
polyleven is a Pythonic Levenshtein distance library that:
- Is fast independent of input types, and hence can be used for both short (like English words) and long input types (like DNA sequences).
- Can be used readily in a manner not covered by restrictive licenses such as GPL, hence can be used freely in private codes.
- Supports Python 3.x.
The official package is available on PyPI:
$ pip install polyleven
Polyleven provides a single interface function levenshtein()
. You can use this function to measure the similarity of two strings.
>>> from polyleven import levenshtein >>> levenshtein('aaa', 'ccc') 3
If you only care about distances under a certain threshold, you can pass the max threshold to the third argument.
>>> levenshtein('acc', 'ccc', 1) 1 >>> levenshtein('aaa', 'ccc', 1) 2
In general, you can gain a noticeable speed boost with threshold k < 3.
To compare Polyleven with other Pythonic edit distance libraries, a million word pairs was generated from SCOWL.
Each library was measured how long it takes to evaluate all of these words. The following table summarises the result:
Function Name | TIME[sec] | SPEED[pairs/s] |
---|---|---|
edlib |
|
|
editdistance |
|
|
jellyfish.levenshtein_distance |
|
|
distance.levenshtein |
|
|
Levenshtein.distance |
|
|
polyleven.levenshtein |
|
|
To evaluate the efficiency for longer inputs, I created 5000 pairs of random strings of size 16, 32, 64, 128, 256, 512 and 1024.
Each library was measured how fast it can process these entries.1
Library | N=16 | N=32 | N=64 | N=128 | N=256 | N=512 | N=1024 |
---|---|---|---|---|---|---|---|
edlib | 0.040 | 0.063 | 0.094 | 0.205 | 0.432 | 0.908 |
|
editdistance | 0.027 | 0.049 | 0.086 | 0.178 | 0.336 | 0.740 | 58.139 |
jellyfish | 0.009 | 0.032 | 0.118 | 0.470 | 1.874 | 8.877 | 42.848 |
distance | 0.007 | 0.029 | 0.109 | 0.431 | 1.726 | 6.950 | 27.998 |
Levenshtein | 0.006 | 0.022 | 0.085 | 0.336 | 1.328 | 5.286 | 21.097 |
polyleven | 0.003 | 0.005 | 0.010 | 0.043 | 0.149 | 0.550 |
|
Library | Version | URL |
---|---|---|
edlib | v1.2.1 | https://github.com/Martinsos/edlib |
editdistance | v0.4 | https://github.com/aflc/editdistance |
jellyfish | v0.5.6 | https://github.com/jamesturk/jellyfish |
distance | v0.1.3 | https://github.com/doukremt/distance |
Levenshtein | v0.12 | https://github.com/ztane/python-Levenshtein |
polyleven | v0.3 | https://github.com/fujimotos/polyleven |
Measured using Python 3.5.3 on Debian Jessie with Intel Core i3-4010U (1.70GHz)↩