Skip to content

GSoC 2022 Apache Lucene Search

Oliver Kopp edited this page Mar 26, 2022 · 1 revision

This page summarizes the main points of the GSoC 2022 project "Apache Lucene Search" (https://github.com/JabRef/www.jabref.org/blob/main/GSoC2022.md#apache-lucene-search)

  • Description: JabRef offers an extensive search function that is based on a custom search syntax. The goal is to replace the custom search syntax and grammar with Apache Lucene's search syntax. It should offer the same functionality as the existing search.
  • Skills required: Java, JavaFX (experience with Lucene is a plus)
  • Expected outcome: A functioning search that supports the same functionality as the old search. More information can be found in this PR#8206
  • Possible mentors: @koppor, @Siedlerchr, @calixtus
  • Project size: 175h (medium)

Currently,

  1. the search is slow for a large database (>10k Entries)
  2. the search syntax is custom for JabRef and not a common one (Lucene is more common)
  3. there are many open issues in the search itself. All relevent for this project can be found using following query: https://github.com/JabRef/jabref/issues?q=is%3Aopen+label%3A%22project%3A+GSoC%22+label%3A%22search%22

The main goal of the project is to offer a fast and good search based n Lucene. Factors for that are: Support of non-ASCII characters: For instance, if a user searches for "Breitenbucher", also "Breitenbücher" should be matched. Be aware that Breitenbücher can be encoded as a) Breitenbücher (UTF8), b) Breitenb"ucher, and maybe d) Breitenbuecher in the library (see JabRef/jabref#6815 for details). The latter is not that easy, could be out of scope of GSoC.

Thus, the steps are:

  1. Make a concept to use the lucene search index as index for bib entries. -- This could involve to update the lucene index after a bib entry is added/changed. One has also to think of the cases a) where the index is not available at start and b) if the bib file was changed outside of JabRef (timestamp-based check?). - Sub step: Dive into the existing implementation start at https://github.com/JabRef/jabref/pull/8206. One can base on that code.
  2. Implement the concept, so that the search of JabRef is based on the search index
  3. Go through the issues listed at https://github.com/JabRef/jabref/issues?q=is%3Aopen+label%3A%22project%3A+GSoC%22+label%3A%22search%22 whether these issues are fixed. If an issue is not fixed, work on fixing it.

Just for information: