FM-Index - Compressed full-text Index

A simple C++ based FM-Index [1] implementation using RRR [4] wavelet trees [5] which allows to build a full-text index over a given text T of size n supporting the following operations:

count(P,m) : count the number of occurences of pattern P of size m in T.
locate(P,m) : locate the text positions of all occurences of P of size m in T.
extract(A,B) : extract T[A,B] from the index.
recover() : recover T from the index.

The constructed index uses nH_k + o(n log sigma) bits of space [3] which is roughly the size of the compressed representation of T and can perform the above operations without the need to store T. An empirical evaluation of the index is sown in the Benchmark section below.

Drawbacks of the FM-Index is long construction time and high memory requirements during construction.

Usage

Compiling the Index

make

Building an Index

./fmbuild alice29.txt alice29.txt.fm

Builds and writes the FM-Index alice29.txt.fm.

Running count() queries

./fmcount -i alice29.txt.fm alice.qry

The queries are stored in a new line seperated file:

the
house
keep
Alice
and

The index returns the number of occurrences for each query:

./fmcount -i alice29.txt.fm alice.qry
the : 2101
house : 20
keep : 11
Alice : 395
and : 880

Running locate() queries

./fmlocate -i alice29.txt.fm alice.qry

The index returns a sorted list of the locations of all occurences for each query:

./fmlocate -i alice29.txt.fm alice.qry
Read 3 queries
keep (11) : 46385 51125 69491 74680 81562 83046 104830 105180 133621 149966 151623
poison (3) : 8151 8619 8731
tomorrow (1) : 63637

Running extract() queries

./fmextract -i alice29.txt.fm alice.extract

The queries are stored in a new line seperated file:

118 147
1213 1245
24 55

The index returns the extracted text snippet for each query:

./fmextract -i alice29.txt.fm alice.extract
118 - 147 : 'THE MILLENNIUM FULCRUM EDITION'
1213 - 1245 : 'TOOK A WATCH OUT OF ITS WAISTCOAT'
24 - 55 : 'ALICE'S ADVENTURES IN WONDERLAND'

Recover the original text from the index

./fmrecover -i alice29.txt.fm

The index outputs the original text to stdout.

Verbose output

The -v command line parameter enables verbose messages:

./fmbuild -v alice29.txt
building index.
- remapping alphabet.
- creating cumulative counts C[].
- performing bwt.
- sample SA locations.
- creating bwt output.
- create RRR wavelet tree over bwt.
build FM-Index done. (0.101 sec)
space usage:
- remap_reverse: 75 bytes (0.07%)
- C: 1028 bytes (0.90%)
- Suffixes: 9508 bytes (8.31%)
- Positions: 9512 bytes (8.31%)
- Sampled: 7948 bytes (6.95%)
- T_bwt: 86088 bytes (75.23%)
input Size n = 152090 bytes
index Size = 114431 bytes (0.75 n)
writing FM Index to file 'alice29.txt.fm'

Using the FM-Index to provide full-text search on a given text `T`

A small code example illustrating the use of the FM class:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "FM.h"

int
main(int argc,char** argv) {
	uint8_t* T = /* read text */
	uint32_t n = /* sizeof T */
	FM* = new FM(T,n);
	if(FM) {
		/* count the occurences of 'house' in T */
		uint32_t cnt = FM->count("house",strlen("house"));
		
		/* get all locations offsets of 'house' in T */
		uint32_t matches; /* number of matches */
		uint32_t* locs; /* list of location offsets */
		locs = FM->locate("house",strlen("house"),&matches);
		
		/* extract text snippet from T */
		uint8_t* snippet = FM->extract(5,251);
		
		/* recover T from index */
		uint32_t nnew; /* size of Tnew */
		uint8_t* Tnew = FM->reconstructText(&nnew);
		
		delete FM;
	}
}

Compile with g++ -o test main.cpp FM.cpp util.c libcds.a libdivsufsort.a

Benchmarks

Experiments are similar to the ones described in [1]. All experiments were run on a Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz using 4GB of RAM. The default samplerate s=64 was used in all experiments.

Test data

Test data was taken from the The Pizza&Chili Site and TREC.

File	Description	Alphabet size	Entropy (bps)
wsj	English Text taken from the TREC wsj collection	90	4.60
src	Concatenated source code (.c,.h,.C,.java) of linux-2.6.11.6 and gcc-4.0.0	230	5.47
proteins	Sequence of newline-separated protein sequences	25	4.20
dna	Gene DNA sequences	4	1.97
xml	XML that provides bibliographic info on compsci pubs(dblp)	96	5.26

Construction

Peak memory usage was measured using the valgrind --tool=massif tool. Running time was measured using the gettimeofday() system call.

File	Size [MB]	Time [sec]	Memory [MB]
wsj	3	2.157	20.02
wsj	6	4.487	39.73
wsj	12	9.293	79.15
wsj	25	19.028	158.0
wsj	50	38.699	315.7
wsj	100	78.287	631.2
wsj	200	158.341	1233
src	50	39.909	315.7
src	100	80.489	631.2
src	200	163.072	1233
proteins	50	35.374	315.7
proteins	100	72.824	631.2
proteins	200	146.173	1233
dna	50	25.941	315.7
dna	100	53.287	631.2
dna	200	109.449	1233
xml	50	36.601	315.7
xml	100	74.104	631.2
xml	200	150.050	1233

Construction time increases linearly O(n) with size n of T. Memory requirement is roughly 6n independent of the file type.

File	Size [MB]	Index Size [MB]	Index Size [relative to file size]
wsj	25	15	0.6
wsj	50	29	0.58
wsj	100	57	0.57
wsj	200	112	0.56
src	50	30	0.77
src	100	59	0.77
src	200	117	0.75
proteins	50	36	0.72
proteins	100	71	0.71
proteins	200	137	0.685
dna	50	24	0.48
dna	100	48	0.48
dna	200	95	0.475
xml	50	25	0.50
xml	100	50	0.50
xml	200	99	0.495

Index size depends on the compressability of T. For each specific file type, as n increases, the size of the auxillary data becomes less significant which leads to better overall compression ratio.

Count

50000 random patterns for each length m=5 to 50 were extracted from each 200MB file.

The graph shows the time in seconds it took the FM-Index to answer all 50000 queries on different file types. Note that the query time grows linearly O(m) with the pattern length. Files with smaller alphabet show faster query times overall.

Also note that the query time (m=20) does not depend on the size n of the text T:

File	Size [MB]	Query Time [sec]
wsj	3	13.37
wsj	6	11.96
wsj	12	12.74
wsj	25	12.74
wsj	50	12.07
wsj	100	14.59
wsj	200	14.34

Overall, searching for 50000 patterns in T using a FM-Index is fast but requieres considerable amount of upfront work (time+space).

Locate

k queries pf length 5 are randomly selected from T so they roughly amount to a certain number of total occurences over all queries.

Note the running time increases linearly with the number of occurrences. Similar to count(), files with small alphabet size perform better most likely due to decreased height of the wavelet tree. Comparing the running times of both count() and locate() we observe that locate() is significantly slower than count() as patterns (especially short ones), on large text collections, tend to have many occurrences.

The samplerate s (default = 64) determines how often the underlying suffix array is sampled to enable faster locate() calls. For different samplerates the index achieves different time and space trade-offs during the locate() operation:

The graph above shows the time required to retrieve the locations of 50000 occurrences compared to the size of the FM-Index for different suffix samplerates. Smaller samplerate decrease the locate() running time but the space used by the index increases. For s=8 the index is 1.5 times the sizes of the original text. For s=512, that is only every 512th position in the underlying suffix array is sampled, the index is less than 50% of the original text. The above graph shows that the best time-space trade-off is achieved for samplerates between 64 and 128.

Extract

Extracting snippets of T from the index shows the same properties as the locate() operation. Like locate(), the running time of extract() grows linearly with the size of the size of the snippet. The running time is also influenced by the samplerate s which provides the same time and space trade-offs described above.

Libraries

The following libraries are needed to use the index. Sourcecode for both libaries is included.

libcds -- succinct low level data structures
libdivsufsort -- fast suffix sorting (requires cmake to build)

Licence

GPL v3 (see LICENCE file)

References

Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005.
Francisco Claude and Gonzalo Navarro. Practical Rank/Select Queries over Arbitrary Sequences. SPIRE'08 176-187, 2008.
Veli M"akinen and Gonzalo Navarro. Implicit Compression Boosting with Applications to Self-Indexing. SPIRE'07 214-226.
R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. SODA'02, 233-242.
R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. SODA'03, 841-850.
The Pizza&Chili Site.

Author

Matthias Petri Matthias.Petri@gmail.com see github.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
benchmark		benchmark
libcds		libcds
libdivsufsort		libdivsufsort
FM.cpp		FM.cpp
FM.h		FM.h
LICENCE		LICENCE
Makefile		Makefile
README.md		README.md
fmbuild.cpp		fmbuild.cpp
fmcount.cpp		fmcount.cpp
fmextract.cpp		fmextract.cpp
fmlocate.cpp		fmlocate.cpp
fmrecover.cpp		fmrecover.cpp
util.c		util.c
util.h		util.h

License

mpetri/FM-Index

Folders and files

Latest commit

History

Repository files navigation