Skip to content

Latest commit

 

History

History
189 lines (134 loc) · 6.33 KB

DESIGN_DOC.md

File metadata and controls

189 lines (134 loc) · 6.33 KB

Design Doc

This is a design doc to outline the potential design of the application. This is not documentation of an existing system.

People

Just me for now, Simon Weiß.

Glossary

  • RAG = Retrieval Augmented Generation
  • LLM = Large Language Model
  • HF = HuggingFace

Overview Idea

Have a LLM-RAG-Demonstrator in Rust. This could be a nice starting point for a Blog Article, Webinar, or even Project Template for future projects.

Context

tbd

Goals and Non-Goals

Goals

  • Local first! This should be easily installable and runnable on a laptop. I have a Mac so that's my primary target. We'll see about windows afterwards
  • Evaluate Rust for Applications that are on the intersection of MacOS/Windows Software and AI
  • Gain experience towards edge computing cases, e.g. can this run on a car or factory robot?
  • Demonstrate capability to run privacy-first OSS models on own data

Non-Goals

  • Production-ready software. Although, I'll explore some Rust libraries for logging, testing, etc
  • Fits-all use cases RAG application. For each the problem domain additional filtering, search, ranking might be necessary (paper with nice graphic, hn comments)

Milestones

  • Get acquainted with Ollama
  • Select and get acquainted with vector db
  • Write script to ingest chunks / docs into vector db
  • Find interesting use case / data
  • Ingest data into vector db
  • Build CLI chatbot
  • Build UI

Existing Solutions

tbd

Proposed Solution

Components

  • Ingestion pipeline to chunk and embed documents into vector db.
  • Vector db, probably pgvector
  • LLM abstraction, i.e., Ollama to be able to use different LLMs without much hassle
  • CLI Application / Backend in Rust that takes plain user prompt, gets relevant documents from vector db, talks to Ollama LLM and returns assistant answer (LLM Output)

Testability, Monitoring, and Alerting

tbd

Open Questions

Which LLM framework

I started with Ollama but I think I'll switch to llamafile at some point (https://simonwillison.net/2023/Nov/29/llamafile/).

Wins for llamafile: No server, no Go, single file

Which use-case for the Demonstrator?

tbd

How to embed documents and prompt locally?

Model

Library

I'll go with fastembed-rs for now. See notes below.

Python:

  • The best option seems to be the sentence transformers library which is built on top of HF transformers and pytorch
    • Supports fine-tuning embedding models which is quite important if there is specific jargon, acronyms, etc

Rust:

  • I need a way to embed the prompt in the Rust application
  • Easiest would be Ollama but it doesn't support good embedding models yet (GH issue for enhancement)
  • Via ONNX: https://github.com/Anush008/fastembed-rs
  • Via Candle: https://github.com/huggingface/text-embeddings-inference . This looks super nice and supports many models. However, it is meant to run as a separate service and doesn't have a client library in Rust. Also, its default mode is running it via the huggingface model hub and servers. However, this could also be super cool. If the server is efficient and maybe even supports fine-tuned embedding models, this could be a super general solution to deploy for many different projects. For local "on-my-laptop" solutions I'm hesitant of using sth. like this though ...

Both via ONNX:

  • Probably quite a bit of work but it might be best to create embeddings in Python pipeline via the sentence transformers library and then export the model to ONNX and use that in the Rust App
  • But then I'd have to distribute the ONNX runtime with the App .... Probably, for real deployment it would be best to go "full Rust" and even get rid of the Ollama dependency and just use Candle or Burn

Which Vector DB?

Update 2024-03-08

LanceDB is the way to go for this project ... I think ;). No seriously, it looks like LanceDB is what I've been looking for. Basically the Sqlite of vector databases. Also, it's also written in Rust. The only drawback is that the Rust SDK is quite experimental yet (they focus on the Python and JS SDKs first).

Update 2024-02-16

To fulfill the "local first" goal, I maybe should build use faiss or annoy on disk or run sqlite with the sqlite-vss extension or check out marqo.

lancedb

Sqlite-vss:

Marqo

  • Wants to be the all-in-one solution, not just the vectordb part
  • Seems to be built on top of vespa, onnx, sbert, etc, stiching together the text splitting, embedding, etc work
  • The docker container is 4.7 GB! https://hub.docker.com/r/marqoai/marqo/tags.
  • Their whole communication seems quite suspicious
  • There are quite a few open Issues regarding compatibility problems with MacOs, etc

My initial Feeling

  • For smaller projects (<100 MM Vectors): PGVector.
  • For projects with high customization needs (additional search capabilities): OpenSearch / ElasticSearch
  • If you need low latency, high throughput: Specialized VectorDB: My favorite would be Qdrant (pinecone if fully managed) but not chroma (who builds a DB in Python?!)

Resources