Reading Time: ~10 minutes
Art-Deco Bot Github Repo: https://github.com/Jet-Engine/rag_art_deco
This blogpost could be read from following links:
Large Language Models (LLMs) have significantly advanced, improving their ability to answer a broad array of questions. However, they still encounter challenges, particularly with specific or recent information, often resulting in inaccuracies or "hallucinations." To address these issues, the Retrieval Augmented Generation (RAG) approach integrates a document retrieval step into the response generation process. This approach uses a corpus of documents and employs vector databases for efficient retrieval, enhancing the accuracy and reliability of LLM responses through three key steps:
- Segmenting documents into manageable chunks.
- Generating embeddings for both the query and document chunks to measure their relevance through similarity scores.
- Retrieving the most relevant chunks and using them as context to generate well-informed answers.
Vector databases facilitate quick similarity searches and efficient data management, making RAG a powerful solution for enhancing LLM capabilities.
The Art-Deco era, spanning the roaring 1920s to the 1940s left a dazzling legacy in architecture.
Despite the capabilities of models like Meta's Llama3, their responses can be unreliable, especially
for nuanced or detailed queries specific to Art-Deco. Our goal with the Art-Deco Bot is to use RAG to improve
the quality of responses about Art-Deco architecture, comparing these with those generated by traditional LLMs
in both quality and time efficiency.
By designing the Art-Deco Bot, we also aim to show how a complex RAG system can be built. You could access whole code at Art-Deco Bot GitHub repository. By examining the code and reading blog-post you would learn:
- How to scrape documents from Wikipedia and store them in a structured format.
- How to index these documents in a vector database for efficient retrieval.
- How to use LiteLLM to query different LLMs easily.
- How to download Ollama and download modes for Ollama.
- How to get API keys from OpenAI and Groq.
- How to write a RAG system that would chunk documents, generate embeddings, and retrieve relevant chunks.
We based our RAG project on Matt Williams' Build RAG with Python project. The code that is taken from here is highly modified and extended. Before reading this blogpost and understanding our project code we recommend you to check Matt's project and related YouTube video.
Ollama is a program that facilitates running LLM models easily on local machines.
- Install Ollama on your local machine by following instructions on the Ollama website.
- Download the required models for the Art-Deco Bot project.
ollama pull llama3
(LLM that would be used for RAG)ollama pull nomic-embed-text
(embedding model that would be used for RAG)
- You could run these models in your terminal after they are downloaded by using the commands above. But having conversations with these models on your terminal screen is not a prerequisite for this project.
In this project we not only aim to write code to show how RAG can be done but also to compare and benchmark results
of RAG with queries to different LLMs. Some of these LLMs could not be locally run like GPT-4
. Some of them could
be run locally but compute-heavy thus we choose to run them on cloud such as running Llama3:70b
on
Groq.
In short, we need to query different LLMs that have different Python libraries. One of the problems LiteLLM strives to solve is to provide a unified interface to query different LLMs. Although LiteLLM has many features, we will use it in our project for this purpose that would make our code cleaner and more readable.
Checking LiteLLM Python library is not a prerequisite for this project, but it is recommended.
Get your API keys from OpenAI and Groq to use them
in the project.
Beware that you may be billed for using these services. Groq API
could be used for free now
but OpenAI API
is not free.
ChromaDB is a vector database that enables efficient storage and retrieval
of document embeddings. To set up ChromaDB
, follow these steps:
- Install ChromaDB by running:
pip install chromadb
- Start the ChromaDB server with:
chroma run --host localhost --port 8000 --path INDEX_PATH
You need to change INDEX_PATH
with the path where you want to store the index data of ChromaDB.
Kick things off by installing all necessary dependencies:
pip install -r requirements.txt
The config.yaml
file serves as the central configuration hub for the Art-Deco Bot project.
It allows you to tailor various aspects of the project setup, from API keys to model choices and
file storage paths. Below you'll find a detailed breakdown of each section within the config.yaml
file
and instructions on how to modify them according to your project needs.
#api_keys:
openai_key: "x" # Replace "x" with your OpenAI API key
groq_key: "y" # Replace "y" with your Groq API key
- openai_key : This is your API key for OpenAI services, used primarily for interfacing with OpenAI's models.
- groq_key : This key is used to access Groq's computational resources. Ensure you replace
"x"
and"y"
with your actual API keys to authenticate requests properly.
#models:
main_model: "llama3" # Primary LLM used for retrieval-augmented tasks
embed_model: "nomic-embed-text" # Model for generating embeddings
- main_model : Specifies the main language model used in the project, which in this case is "llama3:7b".
- embed_model : Indicates the model used for generating embeddings, essential for the RAG functionality. The embedding model is "nomic-embed-text" in our case.
Note that llama3
and llama3:7b
points to same models on Ollama
.
#chromadb:
chroma_host: "localhost" # Host where ChromaDB server is running
chroma_port: 8000 # Port on which ChromaDB server listens
chroma_collection_name: "wiki-art-deco-embeddings" # Collection name for storing embeddings
- chroma_host : The hostname for the ChromaDB server (usually "localhost" if running locally).
- chroma_port : The port number where ChromaDB listens for connections.
- chroma_collection_name : The name of the collection within ChromaDB where document embeddings are stored.
#paths:
rag_files_path: "rag_files/" # Directory where scraped articles are stored
questions_file_path: "evaluation/questions.csv" # Path to the CSV file containing evaluation questions
evaluation_path: "evaluation/" # Directory where evaluation results are stored
- rag_files_path : The directory path where articles fetched by the wiki-bot are stored. This can be adjusted if you prefer a different directory structure.
- questions_file_path : Location of the CSV file with questions used to evaluate the model's performance.
- evaluation_path : Specifies the directory for storing output files from the evaluation scripts.
To modify any of these settings:
- Open the
config.yaml
file in a text editor. - Replace the default values with your desired configurations.
- Save the changes and ensure the project's scripts are directed to use this updated configuration.
By properly configuring your config.yaml
, you can streamline the operation of the Art-Deco Bot to
better fit your infrastructure and project goals
Running the Art-Deco Bot involves several steps, including collecting documents, indexing them in a vector database, and querying the RAG model. Here's a detailed guide to help you navigate through the process.
This step is optional since the content files of all scraped articles are available in the rag_files
directory, so there is no need to repeat the scraping process.
Our initial step involves gathering knowledge about Art-Deco architecture. We focus on U.S. structures,
given their prominence in the Art-Deco movement. The wiki-bot.py
script automates the collection of
relevant Wikipedia articles, organizing them into a structured directory for ease of access.
Run the bot using:
python wiki-bot.py
When you run wiki-bot.py with an empty rag_files directory, it saves the contents of the scraped Wikipedia articles in a sub-folder named text under rag_files. The bot also creates various sub-folders to organize different types of data such as article URLs, references, etc. Since our current focus is only on the contents of the Wikipedia articles, to reduce clutter, we transferred the contents from the text sub-folder to the main rag_files directory and removed all other sub-folders.
Thus, if you want to run the bot yourself—which is unnecessary since the scraped documents are already in the rag_files directory—you would need to either copy all files from the text sub-folder to the rag_files directory and then delete all sub-folders within rag_files, or simply change the rag_files_path in config.yaml to rag_files/text.
Index the documents by running:
python indexing.py
Make sure ChromaDB
is running before executing this script.
Before running chat.py
, ensure the ChromaDB server
is active and the config.yaml
settings are correct,
including API keys for OpenAI
and Groq
.
Customize the queries by editing questions.txt
. To initiate the Art-Deco Bot, run:
python chat.py
The bot outputs its inference and benchmark data in various formats—including HTML, Markdown, JSON, and CSV—to the directory specified by the evaluation_path in the config file. This allows you to assess and compare the response quality between RAG (Retrieval-Augmented Generation) and LLMs (Large Language Models).
The config.yaml
file includes an evaluation path field, which specifies the directory for storing
outputs from the LLMs and RAG. These outputs are generated based on queries in the questions.csv
file and are saved as JSON
, CSV
, and HTML
files for thorough analysis. The data in the CSV
and HTML
files is presented in tabular form, facilitating the review of results from the chat.py
script in an
organized manner.
The evaluation folder houses files generated by executing the chat.py
script with questions from
the questions.csv
file, using a Mac Mini M2 Pro. If you run the chat.py
script without altering
the questions.csv
file, the files produced would be similar in content for LLM and RAG inference columns.
On the other hand, file contents could greatly differ in content for inference time columns.
One of the aims of the Art-Deco Bot project is to compare the responses generated by RAG with those from traditional LLMs. By querying different models, we can evaluate their performance in terms of accuracy, relevance, and time efficiency. Evaluation of the quality of responses is not easy, and it is a subjective task. Quality of RAG responses are also highly correlated with the quality of the documents that are indexed.
Since we aim to experiment with different embedding models and chunking techniques in the future; we skip a thorough evaluation of the quality of responses in this blogpost.
If you are interested you could compare results yourself by checking out generated tables that include
responses and our document set that is indexed for RAG. Interestingly we could outsource this task to LLMs
such as GPT-4
. We give our generated CSV files to GPT-4
and ask it to compare responses of RAG and LLMs.
Below you could see the results:
Analyzing the responses from the ollama_rag
model compared to other LLMs (like GPT-4 and ollama-llama3
) in your
benchmark, we can make several observations regarding correctness, succinctness, and potential for hallucination.
- Correctness:
- The
ollama_rag
model generally provides accurate answers similar to other models. For example, for the question about the opening of Radio City Music Hall, it correctly identifies the opening date as December 27, 1932, which matches the answers from GPT-4 andgroq-llama3-70b
. - However, there are instances where
ollama_rag
gives an incorrect or less accurate answer, such as the height of Rand Tower Hotel, where it provides an answer that lacks a specific figure, in contrast to the correct height given bygroq-llama3-70b
.
- Succinctness:
- The
ollama_rag
responses tend to be more verbose compared to GPT-4. This model provides additional contextual information that might not be necessary to directly answer the question but enriches the user's understanding. For example, in describing the use of Mark Hellinger Theatre in its first decade,ollama_rag
includes a detailed list of different uses, which is informative but more detailed than necessary for direct inquiries. - This verbosity can be seen as a double-edged sword—it enhances detail at the cost of brevity, which may not always align with user expectations for succinctness.
- Hallucination:
- The
ollama_rag
model seems to have issues with fabricating details or providing irrelevant historical context. For example, it mentioned details about different decades and events that were not strictly relevant to the direct use of the Mark Hellinger Theatre in its first decade. This suggests a tendency towards confabulation under certain conditions. - For questions where very specific or less well-known knowledge is required, such as the architectural details of Lamar High School,
ollama_rag
provides a blend of correct and possibly confabulated or less relevant details, which might mislead users who need precise information.
- Comparative Performance:
- Against GPT-4 and other LLMs like
groq-llama3-70b
andollama-llama3
,ollama_rag
holds up reasonably well in terms of factual accuracy but may lag in directness and clarity due to its verbose and occasionally less focused answers. - The
ollama_rag
responses suggest that while it integrates knowledge well, its application might be best suited for scenarios where detailed explorations of topics are more valuable than concise answers.
In summary, the ollama_rag
model demonstrates a robust capability to generate detailed and contextually rich answers, but it may benefit from improvements in precision and adherence to the specific demands of queries to better align with user expectations for direct and succinct information.
- Inference for
LLama3
on RAG tasks take longer time than inference forllama3
on one question tasks. This is expected since as number of tokens in queries increase, inference time increases. - Indexing of document set takes considerable time. For example our Art-Deco document set contains 2109 plain
text files that is around
10MB
in total. Indexing this document set withChromaDB
takes around 10 minutes onMac Mini M2 Pro
. Long indexing time of large document set may be a setback for RAG projects. - Creating embeddings for queries and similarity search on vector database requires negligible time compared to LLM inference.
If you would like to use different LLMs in the project for question querying,
you can modify following part of the chat.py
file
all_models = {
"gpt-4": "gpt-4",
"groq-llama3-8b": "groq/llama3-8b-8192",
"groq-llama3-70b": "groq/llama3-70b-8192",
"ollama-llama3": "ollama/llama3",
"ollama-llama3-70b": "ollama/llama3:70b",
}
selected_models = ["gpt-4", "ollama-llama3", "groq-llama3-70b"]
Note that you need to learn how the LLMs you would like to integrate into project are named internally in LiteLLM.
These names go to values of the all_models
dictionary. You need to add the names of the models
(keys from all_models) you would like to integrate into the bot to the selected_models list.
In future blog posts, we plan to delve deeper into the Art-Deco Bot project.
- We would like to benchmark performance of different vector databases.
- We would like to add more questions to our question set.
- We would like to migrate the project on state-of-the-art vector database
PulseJet
from JetEngine to make our bot more performant and scalable. - We would like to explore different techniques and parameters for chunking and embedding similarity measurements.
- We would like to expand this project into different domains and LLMs with minimal code change.
- We would like to add GUI to the project to make it more user-friendly.
Stay tuned for upcoming interesting stuff to get new insight about the exciting world of RAG and meanwhile appreciating the beauty and elegance of Art-Deco architecture.
Author: Güvenç USANMAZ