Skip to content

Full text search

01es edited this page Jun 26, 2019 · 8 revisions

Full Text Search

Relational databases are very good at what they're designed to do, but poor at many other things related to data that do not fit the relational model. Amongst such problematic areas is a full text search. Relational database systems provide only rudimental support for text search with wildcard and primitive normalisation (e.g. ignore upper and lower cases during search), which for mosts RDBMS systems results is full table scan -- simply unacceptable for producing search result in responsive manner.

Therefore, a more adequate solutions are needed. One of the requirements for such a solution in addition to perform full text search quickly, is its ability to integrated with an existing relational database.

Basic notations/nomenclature and approaches

A full text search mechanism preforms relevancy ranking for text based on the specified search criteria. The search result is usually stored in inverted index for fast retrieval. Basically, what the search mechanism does is indexes the data, which is in our case is stored in a relational database, and stores the result in a specialised data structure that is efficient specifically to perform very fast lookups that correspond to text search queries.

Naturally, there should be a background process that would perform such indexing of the data from a designated relational database. This means that there would always be a discrepancy (potentially insignificant) between the actually persisted data and the searchable data.

These two concepts are clearly partitioned from a system design perspective, with a search mechanism complementing the relational data. For example, the inventory management is done via the relational database, but the inventory searching is done using a full text search mechanism. Due to RDBMS's ACID properties, actions such as creation of a purchase order or a new inventory entry are processed through the relational database while flexible searches are handled by some full text mechanism.

Available technologies

There are many really cool and advanced algorithms to perform text relevancy ranking. However, instead of implementing something from scratch, at least at the beginning, an existing well proven technology such as Lucene Search should be used as the basis for incorporating a full text search into the platform.

Lucene is a set of Java libraries that implement a variety of search algorithms -- all listed on the referenced site. There are already products that are build on top of Lucene and provide additional services, including integration with relational databases.

Apache Solr is one of such products that is based on the Lucene engine. The communication link between Solr and the RDBMS from the indexing perspective can be handled by Solr’s DIH delta-import feature and a UNIX cron job to periodically invoke a Solr + DIH URL to index any changes or by pushing changes directly into Solr by POSTing updates.

In order to better understand the difference between Lucene (the engine) and Solr (the car) this site provides a comprehensive discussion.

Lucene capabilities

Lucene provides a comprehensive query language to specify searches. This includes Wildcard Searches, Regular Expression, Fuzzy Searches and more

Clone this wiki locally