Skip to content

Jet-Engine/rag_art_deco

Repository files navigation

Architecting Data: Building the Art-Deco Bot with RAG

Reading Time: ~10 minutes

Art-Deco Bot Github Repo: https://github.com/Jet-Engine/rag_art_deco

This blogpost could be read from following links:

  1. JetEngine's Blog Post
  2. JetEngine's Medium Blog Post

Introduction to RAG

Large Language Models (LLMs) have significantly advanced, improving their ability to answer a broad array of questions. However, they still encounter challenges, particularly with specific or recent information, often resulting in inaccuracies or "hallucinations." To address these issues, the Retrieval Augmented Generation (RAG) approach integrates a document retrieval step into the response generation process. This approach uses a corpus of documents and employs vector databases for efficient retrieval, enhancing the accuracy and reliability of LLM responses through three key steps:

  1. Segmenting documents into manageable chunks.
  2. Generating embeddings for both the query and document chunks to measure their relevance through similarity scores.
  3. Retrieving the most relevant chunks and using them as context to generate well-informed answers.

Vector databases facilitate quick similarity searches and efficient data management, making RAG a powerful solution for enhancing LLM capabilities.

Purpose of the Art-Deco Bot

The Art-Deco era, spanning the roaring 1920s to the 1940s left a dazzling legacy in architecture.
Despite the capabilities of models like Meta's Llama3, their responses can be unreliable, especially for nuanced or detailed queries specific to Art-Deco. Our goal with the Art-Deco Bot is to use RAG to improve the quality of responses about Art-Deco architecture, comparing these with those generated by traditional LLMs in both quality and time efficiency.

By designing the Art-Deco Bot, we also aim to show how a complex RAG system can be built. You could access whole code at Art-Deco Bot GitHub repository. By examining the code and reading blog-post you would learn:

  • How to scrape documents from Wikipedia and store them in a structured format.
  • How to index these documents in a vector database for efficient retrieval.
  • How to use LiteLLM to query different LLMs easily.
  • How to download Ollama and download modes for Ollama.
  • How to get API keys from OpenAI and Groq.
  • How to write a RAG system that would chunk documents, generate embeddings, and retrieve relevant chunks.

Setting Up the Art-Deco Bot

Prerequisites

Checking Matt Williams' RAG Project

We based our RAG project on Matt Williams' Build RAG with Python project. The code that is taken from here is highly modified and extended. Before reading this blogpost and understanding our project code we recommend you to check Matt's project and related YouTube video.

Ollama

Ollama is a program that facilitates running LLM models easily on local machines.

  • Install Ollama on your local machine by following instructions on the Ollama website.
  • Download the required models for the Art-Deco Bot project.
    • ollama pull llama3 (LLM that would be used for RAG)
    • ollama pull nomic-embed-text (embedding model that would be used for RAG)
  • You could run these models in your terminal after they are downloaded by using the commands above. But having conversations with these models on your terminal screen is not a prerequisite for this project.

LiteLLM

In this project we not only aim to write code to show how RAG can be done but also to compare and benchmark results of RAG with queries to different LLMs. Some of these LLMs could not be locally run like GPT-4. Some of them could be run locally but compute-heavy thus we choose to run them on cloud such as running Llama3:70b on Groq.

In short, we need to query different LLMs that have different Python libraries. One of the problems LiteLLM strives to solve is to provide a unified interface to query different LLMs. Although LiteLLM has many features, we will use it in our project for this purpose that would make our code cleaner and more readable.

Checking LiteLLM Python library is not a prerequisite for this project, but it is recommended.

API Keys

Get your API keys from OpenAI and Groq to use them in the project. Beware that you may be billed for using these services. Groq API could be used for free now but OpenAI API is not free.

Setting Up ChromaDB

ChromaDB is a vector database that enables efficient storage and retrieval of document embeddings. To set up ChromaDB, follow these steps:

  1. Install ChromaDB by running: pip install chromadb
  2. Start the ChromaDB server with: chroma run --host localhost --port 8000 --path INDEX_PATH

You need to change INDEX_PATH with the path where you want to store the index data of ChromaDB.

Installing Dependencies

Kick things off by installing all necessary dependencies:

pip install -r requirements.txt

Configuration with config.yaml

Overview

The config.yaml file serves as the central configuration hub for the Art-Deco Bot project. It allows you to tailor various aspects of the project setup, from API keys to model choices and file storage paths. Below you'll find a detailed breakdown of each section within the config.yaml file and instructions on how to modify them according to your project needs.

Configuration Details
API Keys
#api_keys:
openai_key: "x"  # Replace "x" with your OpenAI API key
groq_key:   "y"  # Replace "y" with your Groq API key
  • openai_key : This is your API key for OpenAI services, used primarily for interfacing with OpenAI's models.
  • groq_key : This key is used to access Groq's computational resources. Ensure you replace "x" and "y" with your actual API keys to authenticate requests properly.
Models
#models:
main_model: "llama3"            # Primary LLM used for retrieval-augmented tasks
embed_model: "nomic-embed-text" # Model for generating embeddings
  • main_model : Specifies the main language model used in the project, which in this case is "llama3:7b".
  • embed_model : Indicates the model used for generating embeddings, essential for the RAG functionality. The embedding model is "nomic-embed-text" in our case.

Note that llama3 and llama3:7b points to same models on Ollama.

ChromaDB Configuration
#chromadb:
chroma_host: "localhost"                 # Host where ChromaDB server is running
chroma_port: 8000                        # Port on which ChromaDB server listens
chroma_collection_name: "wiki-art-deco-embeddings" # Collection name for storing embeddings
  • chroma_host : The hostname for the ChromaDB server (usually "localhost" if running locally).
  • chroma_port : The port number where ChromaDB listens for connections.
  • chroma_collection_name : The name of the collection within ChromaDB where document embeddings are stored.
File Paths
#paths:
rag_files_path: "rag_files/"             # Directory where scraped articles are stored
questions_file_path: "evaluation/questions.csv"  # Path to the CSV file containing evaluation questions
evaluation_path: "evaluation/"           # Directory where evaluation results are stored
  • rag_files_path : The directory path where articles fetched by the wiki-bot are stored. This can be adjusted if you prefer a different directory structure.
  • questions_file_path : Location of the CSV file with questions used to evaluate the model's performance.
  • evaluation_path : Specifies the directory for storing output files from the evaluation scripts.
Modifying the Configuration

To modify any of these settings:

  1. Open the config.yaml file in a text editor.
  2. Replace the default values with your desired configurations.
  3. Save the changes and ensure the project's scripts are directed to use this updated configuration.

By properly configuring your config.yaml, you can streamline the operation of the Art-Deco Bot to better fit your infrastructure and project goals

Running the Art-Deco Bot

Running the Art-Deco Bot involves several steps, including collecting documents, indexing them in a vector database, and querying the RAG model. Here's a detailed guide to help you navigate through the process.

(OPTIONAL) Collecting Documents with wiki-bot.py

This step is optional since the content files of all scraped articles are available in the rag_files directory, so there is no need to repeat the scraping process.

Our initial step involves gathering knowledge about Art-Deco architecture. We focus on U.S. structures, given their prominence in the Art-Deco movement. The wiki-bot.py script automates the collection of relevant Wikipedia articles, organizing them into a structured directory for ease of access.

Run the bot using:

python wiki-bot.py

When you run wiki-bot.py with an empty rag_files directory, it saves the contents of the scraped Wikipedia articles in a sub-folder named text under rag_files. The bot also creates various sub-folders to organize different types of data such as article URLs, references, etc. Since our current focus is only on the contents of the Wikipedia articles, to reduce clutter, we transferred the contents from the text sub-folder to the main rag_files directory and removed all other sub-folders.

Thus, if you want to run the bot yourself—which is unnecessary since the scraped documents are already in the rag_files directory—you would need to either copy all files from the text sub-folder to the rag_files directory and then delete all sub-folders within rag_files, or simply change the rag_files_path in config.yaml to rag_files/text.

Indexing Documents for Vector Database with indexing.py

Index the documents by running:

python indexing.py

Make sure ChromaDB is running before executing this script.

Doing LLM Inference and RAG with chat.py

Before running chat.py, ensure the ChromaDB server is active and the config.yaml settings are correct, including API keys for OpenAI and Groq.

Customize the queries by editing questions.txt. To initiate the Art-Deco Bot, run:

python chat.py

The bot outputs its inference and benchmark data in various formats—including HTML, Markdown, JSON, and CSV—to the directory specified by the evaluation_path in the config file. This allows you to assess and compare the response quality between RAG (Retrieval-Augmented Generation) and LLMs (Large Language Models).

Output Generation by `chat.py

The config.yaml file includes an evaluation path field, which specifies the directory for storing outputs from the LLMs and RAG. These outputs are generated based on queries in the questions.csv file and are saved as JSON, CSV, and HTML files for thorough analysis. The data in the CSV and HTML files is presented in tabular form, facilitating the review of results from the chat.py script in an organized manner.

The evaluation folder houses files generated by executing the chat.py script with questions from the questions.csv file, using a Mac Mini M2 Pro. If you run the chat.py script without altering the questions.csv file, the files produced would be similar in content for LLM and RAG inference columns. On the other hand, file contents could greatly differ in content for inference time columns.

Comparison of Responses

One of the aims of the Art-Deco Bot project is to compare the responses generated by RAG with those from traditional LLMs. By querying different models, we can evaluate their performance in terms of accuracy, relevance, and time efficiency. Evaluation of the quality of responses is not easy, and it is a subjective task. Quality of RAG responses are also highly correlated with the quality of the documents that are indexed.

Since we aim to experiment with different embedding models and chunking techniques in the future; we skip a thorough evaluation of the quality of responses in this blogpost.

If you are interested you could compare results yourself by checking out generated tables that include responses and our document set that is indexed for RAG. Interestingly we could outsource this task to LLMs such as GPT-4. We give our generated CSV files to GPT-4 and ask it to compare responses of RAG and LLMs. Below you could see the results:

GPT-4 Analysis of RAG and LLM Responses

Analyzing the responses from the ollama_rag model compared to other LLMs (like GPT-4 and ollama-llama3) in your benchmark, we can make several observations regarding correctness, succinctness, and potential for hallucination.

  1. Correctness:
  • The ollama_rag model generally provides accurate answers similar to other models. For example, for the question about the opening of Radio City Music Hall, it correctly identifies the opening date as December 27, 1932, which matches the answers from GPT-4 and groq-llama3-70b.
  • However, there are instances where ollama_rag gives an incorrect or less accurate answer, such as the height of Rand Tower Hotel, where it provides an answer that lacks a specific figure, in contrast to the correct height given by groq-llama3-70b.
  1. Succinctness:
  • The ollama_rag responses tend to be more verbose compared to GPT-4. This model provides additional contextual information that might not be necessary to directly answer the question but enriches the user's understanding. For example, in describing the use of Mark Hellinger Theatre in its first decade, ollama_rag includes a detailed list of different uses, which is informative but more detailed than necessary for direct inquiries.
  • This verbosity can be seen as a double-edged sword—it enhances detail at the cost of brevity, which may not always align with user expectations for succinctness.
  1. Hallucination:
  • The ollama_rag model seems to have issues with fabricating details or providing irrelevant historical context. For example, it mentioned details about different decades and events that were not strictly relevant to the direct use of the Mark Hellinger Theatre in its first decade. This suggests a tendency towards confabulation under certain conditions.
  • For questions where very specific or less well-known knowledge is required, such as the architectural details of Lamar High School, ollama_rag provides a blend of correct and possibly confabulated or less relevant details, which might mislead users who need precise information.
  1. Comparative Performance:
  • Against GPT-4 and other LLMs like groq-llama3-70b and ollama-llama3, ollama_rag holds up reasonably well in terms of factual accuracy but may lag in directness and clarity due to its verbose and occasionally less focused answers.
  • The ollama_rag responses suggest that while it integrates knowledge well, its application might be best suited for scenarios where detailed explorations of topics are more valuable than concise answers.

In summary, the ollama_rag model demonstrates a robust capability to generate detailed and contextually rich answers, but it may benefit from improvements in precision and adherence to the specific demands of queries to better align with user expectations for direct and succinct information.

Comparison of Response Times

  • Inference for LLama3 on RAG tasks take longer time than inference for llama3 on one question tasks. This is expected since as number of tokens in queries increase, inference time increases.
  • Indexing of document set takes considerable time. For example our Art-Deco document set contains 2109 plain text files that is around 10MB in total. Indexing this document set with ChromaDB takes around 10 minutes on Mac Mini M2 Pro. Long indexing time of large document set may be a setback for RAG projects.
  • Creating embeddings for queries and similarity search on vector database requires negligible time compared to LLM inference.

Modifying chat.py for Different LLMs

If you would like to use different LLMs in the project for question querying, you can modify following part of the chat.py file

all_models = {
        "gpt-4": "gpt-4",
        "groq-llama3-8b": "groq/llama3-8b-8192",
        "groq-llama3-70b": "groq/llama3-70b-8192",
        "ollama-llama3": "ollama/llama3",
        "ollama-llama3-70b": "ollama/llama3:70b",
    }

selected_models = ["gpt-4", "ollama-llama3", "groq-llama3-70b"]

Note that you need to learn how the LLMs you would like to integrate into project are named internally in LiteLLM. These names go to values of the all_models dictionary. You need to add the names of the models (keys from all_models) you would like to integrate into the bot to the selected_models list.

Roadmap for the Art-Deco Bot

In future blog posts, we plan to delve deeper into the Art-Deco Bot project.

  • We would like to benchmark performance of different vector databases.
  • We would like to add more questions to our question set.
  • We would like to migrate the project on state-of-the-art vector database PulseJet from JetEngine to make our bot more performant and scalable.
  • We would like to explore different techniques and parameters for chunking and embedding similarity measurements.
  • We would like to expand this project into different domains and LLMs with minimal code change.
  • We would like to add GUI to the project to make it more user-friendly.

Stay tuned for upcoming interesting stuff to get new insight about the exciting world of RAG and meanwhile appreciating the beauty and elegance of Art-Deco architecture.

Author: Güvenç USANMAZ