This is a design doc to outline the potential design of the application. This is not documentation of an existing system.
Just me for now, Simon Weiß.
- RAG = Retrieval Augmented Generation
- LLM = Large Language Model
- HF = HuggingFace
Have a LLM-RAG-Demonstrator in Rust. This could be a nice starting point for a Blog Article, Webinar, or even Project Template for future projects.
tbd
- Local first! This should be easily installable and runnable on a laptop. I have a Mac so that's my primary target. We'll see about windows afterwards
- Evaluate Rust for Applications that are on the intersection of MacOS/Windows Software and AI
- Gain experience towards edge computing cases, e.g. can this run on a car or factory robot?
- Demonstrate capability to run privacy-first OSS models on own data
- Production-ready software. Although, I'll explore some Rust libraries for logging, testing, etc
- Fits-all use cases RAG application. For each the problem domain additional filtering, search, ranking might be necessary (paper with nice graphic, hn comments)
- Get acquainted with Ollama
- Select and get acquainted with vector db
- Write script to ingest chunks / docs into vector db
- Find interesting use case / data
- Ingest data into vector db
- Build CLI chatbot
- Build UI
tbd
- Ingestion pipeline to chunk and embed documents into vector db.
- Vector db, probably pgvector
- LLM abstraction, i.e., Ollama to be able to use different LLMs without much hassle
- CLI Application / Backend in Rust that takes plain user prompt, gets relevant documents from vector db, talks to Ollama LLM and returns assistant answer (LLM Output)
tbd
I started with Ollama but I think I'll switch to llamafile at some point (https://simonwillison.net/2023/Nov/29/llamafile/).
Wins for llamafile: No server, no Go, single file
tbd
- Sbert is probably a good way to start, see this X post
- But there are better models, see hf leaderboard
multilingual-e5-large
seems to work well for German reddit-1, reddit-2
I'll go with fastembed-rs
for now. See notes below.
Python:
- The best option seems to be the
sentence transformers library which is
built on top of HF transformers and pytorch
- Supports fine-tuning embedding models which is quite important if there is specific jargon, acronyms, etc
Rust:
- I need a way to embed the prompt in the Rust application
- Easiest would be Ollama but it doesn't support good embedding models yet (GH issue for enhancement)
- Via ONNX: https://github.com/Anush008/fastembed-rs
- Via Candle: https://github.com/huggingface/text-embeddings-inference . This looks super nice and supports many models. However, it is meant to run as a separate service and doesn't have a client library in Rust. Also, its default mode is running it via the huggingface model hub and servers. However, this could also be super cool. If the server is efficient and maybe even supports fine-tuned embedding models, this could be a super general solution to deploy for many different projects. For local "on-my-laptop" solutions I'm hesitant of using sth. like this though ...
Both via ONNX:
- Probably quite a bit of work but it might be best to create embeddings in Python pipeline via the sentence transformers library and then export the model to ONNX and use that in the Rust App
- But then I'd have to distribute the ONNX runtime with the App .... Probably, for real deployment it would be best to go "full Rust" and even get rid of the Ollama dependency and just use Candle or Burn
LanceDB is the way to go for this project ... I think ;). No seriously, it looks like LanceDB is what I've been looking for. Basically the Sqlite of vector databases. Also, it's also written in Rust. The only drawback is that the Rust SDK is quite experimental yet (they focus on the Python and JS SDKs first).
To fulfill the "local first" goal, I maybe should build use faiss or annoy on disk or run sqlite with the sqlite-vss extension or check out marqo.
lancedb
- looks promising, could be a real option and probably the better alternative to sqlite-vss
- https://github.com/lancedb/lancedb
Sqlite-vss:
- The author gives an honest overview: https://observablehq.com/@asg017/introducing-sqlite-vss
Marqo
- Wants to be the all-in-one solution, not just the vectordb part
- Seems to be built on top of vespa, onnx, sbert, etc, stiching together the text splitting, embedding, etc work
- The docker container is 4.7 GB! https://hub.docker.com/r/marqoai/marqo/tags.
- Their whole communication seems quite suspicious
- There are quite a few open Issues regarding compatibility problems with MacOs, etc
- For smaller projects (<100 MM Vectors): PGVector.
- For projects with high customization needs (additional search capabilities): OpenSearch / ElasticSearch
- If you need low latency, high throughput: Specialized VectorDB: My favorite would be Qdrant (pinecone if fully managed) but not chroma (who builds a DB in Python?!)
- https://news.ycombinator.com/item?id=36943318
- https://ann-benchmarks.com/glove-100-angular_10_angular.html
- https://docs.google.com/spreadsheets/d/1oAeF4Q7ILxxfInGJ8vTsBck3-2U9VV8idDf3hJOozNw/edit?pli=1#gid=0
- https://www.sicara.fr/blog-technique/how-to-choose-your-vector-database-in-2023
- https://maven.com/blog/embeddings