LLM Powered Search Engine #3503

t83714 · 2024-02-16T03:37:46Z

LLM Powered Search Engine

This epic is about adding the LLM (Large language model) powered Search Engine to open-sourced Magda code space in addition to the existing keyword-based search engines.

This ticket is an epic that provides an overview of the problem that we are trying to solve.

1. Motivation

We need a vector store/searching engine to facilitate LLM embedding-based indexing & searching.

2. Indexing Strategy

We will be able to locate the most relevant information for building context for LLM without LLM being involved in the search process.
We will need a flexible indexing framework that can support various data sources/formats.
- not only indexing text-based metadata fields but also relevant data files
- For text-based data files (e.g. PDF, words), we can index as a large chunk of text
  - Although we need to decide how to include the metadata:
    - option 1: include in the text. e.g. document author name etc
    - option 2: extra metadata fields are used to recover the full context from the text chunk. e.g. chunk position
  - Also, need to design a protocol for handling non-text content within text-based documents
    - Scenario 1: graphic item. e.g. chart, graph etc.: we can index:
      - name (e.g. fig1) & a short description of the graphic item (usually underneath )
      - also, pick N chunk of text that mentions the graphic item name (e.g. fig1)
- For non-text-based data files, we need to design the indexing strategy for each data format
  - e.g. tabular data CSV, we should at least index the list of column name
    - if any data dictionary information is available, we should index it as well
    - If the indexing module has the capability where the data dictionary information is missing, we should try to guess the column data type as well

3. Vector Store

We will use OpenSearch 2.x knn-vector field. Why?

Most flexible
- Support more than one engine library. e.g. the nmslib, faiss, and Lucene libraries. More see here
The provided Approximate k-NN approach is the best choice for searches over large indexes.
- Although other methods are also provided for different use cases
- More see here: https://opensearch.org/docs/latest/search-plugins/knn/index/
Matching Magda's search engine option
- Magda now use Elasricsearch 6.5 rather than opensearch 2.x
- We have a ticket for upgrading to opensearch 2.x: Upgrade Elasticsearch to OSS Opensearch 2.x #3261

4. Indexing Module / Microservice

We need to introduce a new module to our platform based on Magda's minion framework.

How it works
- Magda's registry metadata store can notify the minion of any metadata changes
- the minion should wake up and perform the indexing tasks based on the changes
  - The minion module should implement an extendable code base so that we can adopt the changes of the Indexing Strategy (see section 2 above) while it evolves.
  - The minion can be written in typescript or any other language. But requires the capability of calling other language modules as a forked process.
  - The minion should store the indexing result in our internal open search instance according to our open search indexing design
- The minion should support a recrawl interface to redo the index globally.
  - This offers us an option to re-build the index after index design changes

Some of the design has been covered by tickets I created for AI4M data-sharing platform:

https://github.com/ai4m-p11/data-sharing-portal/issues/98
https://github.com/ai4m-p11/data-sharing-portal/issues/100
and tickets 101-106

But this ticket is for more generic use cases and will become the common base/facility for all Magda based projects

The text was updated successfully, but these errors were encountered:

t83714 added feature request Epic labels Feb 16, 2024

t83714 self-assigned this Feb 19, 2024

This was referenced May 27, 2024

LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design #3536

Open

LLM Indexing Strategy: registry record metadata #3537

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Powered Search Engine #3503

LLM Powered Search Engine #3503

t83714 commented Feb 16, 2024 •

edited

LLM Powered Search Engine #3503

LLM Powered Search Engine #3503

Comments

t83714 commented Feb 16, 2024 • edited