🌸 Nagato

Nagato is a framework that enables any developer to streamline the creation of fine-tuned embedding and language models specifically tailored to do question/answering on a corpus of data.

Quick Start Guide • Features • Key benefits • How it works

Quick Start Quide

Full documentation of all methods in the nagato library will be posted soon.

Change the name of .env-example to .env and populate the environment variables
Install the nagato-ai package using either PIP or Poetry:

For PIP:

pip install nagato-ai

For Poetry:

poetry add nagato-ai

Create and store embeddings

from nagato import create_vector_embeddings

results = create_vector_embeddings(
  type: "PDF",
  filter_id: "MY_DOCUMENT_ID", 
  url: "https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q2-2023-Update.pdf", 
)

Create fine-tuned model

from nagato import create_finetuned_model

results = create_finetuned_model(
  url="https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q2-2023-Update.pdf",
  type="PDF",
  base_model="LLAMA2_7B_CHAT",
  provider="REPLICATE",
  webhook_url="https://webhook.site/ebe803b9-1e34-4b20-a6ca-d06356961cd1",
)

Features

Data ingestion from various formats such as JSON, CSV, TXT, PDF, etc.
Data embedding using pre-trained or finetuned models.
Storage of embedded vectors
Automatic generation of question/answer pairs for model finetuning
Built in code interpreter
API concurrency for scalalbility and performance
Workflow management for ingestion pipelines

Key benefits

Faster inference: Generic models often bring overhead in terms of computational time due to their broad-based training. In contrast, our fine-tuned models are optimized for specific domains, enabling faster inference and more timely results.
Lower costs: Utilizing fine-tuned models tailored for a specific corpus minimizes the number of tokens needed for accurate understanding and response generation. This reduction in token count translates to decreased computational costs and thus lower operational expenses.
Better results: Fine-tuned models offer superior performance on specialized tasks when compared to generic, all-purpose models. Whether you're generating embeddings or answering complex queries, you can expect more accurate and contextually relevant outcomes.

How it works

Nagato utilizes distinct strategies to process structured and unstructured data, aiming to produce fine-tuned models for both types. Below is a breakdown of how this is accomplished:

Unstructured data:

Selection of Embedding Model: The first step involves a careful analysis of the textual content to select an appropriate text-based embedding model. Based on various characteristics of the corpus such as vocabulary, context, and domain-specific jargon, Nagato picks the most suitable pre-trained text-based model for embedding.
Fine-Tuning the Embedding Model: Once the initial text-based model is selected, it is then fine-tuned to align more closely with the specific domain or subject matter of the corpus. This ensures that the embeddings generated are as accurate and relevant as possible.
Fine-Tuning the Language Model: After generating and storing embeddings, Nagato creates question-answer pairs for the purpose of fine-tuning a GPT-based language model. This yields a language model that is highly specialized in understanding and generating text within the domain of the corpus.

Structured data:

Sandboxed REPL: Nagato features a secure, sandboxed Read-Eval-Print Loop (REPL) environment to execute code snippets against the structured text data. This facilitates flexible and dynamic processing of structured data formats like JSON, CSV or XML.
Evaluation/Prediction Using a Code Interpreter: Post-initial processing, a code interpreter evaluates various code snippets within the sandboxed environment to produce predictions or analyses based on the structured text data. This capability allows the extraction of highly specialized insights tailored to the domain or subject matter.

Citation

If you use Nagato in your research, please cite it as follows:

@misc{nagato,
  author = {Ismail Pelaseyed},
  title = {Nagato: The open framework for Q&A finetuning LLMs on private data},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/homanp/nagato}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
nagato		nagato
tests		tests
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.py		setup.py

License

simjak/nagato

Folders and files

Latest commit

History

Repository files navigation

🌸 Nagato

Nagato is a framework that enables any developer to streamline the creation of fine-tuned embedding and language models specifically tailored to do question/answering on a corpus of data.

Quick Start Quide

Features

Key benefits

How it works

Unstructured data:

Structured data:

Citation

About

Resources

License

Stars

Watchers

Forks

Languages