Skip to content

cwenner/retrospective-llm-eval

Repository files navigation

Code and datasets for retroactive evaluation

Setup

Step 1. Copy .env.template to .env and enter your keys; notably OPENAI_API_KEY.

Step 2. Install Python requirements:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt

Use

python evaluate_dataset.py <optional arguments>

By default, this will run an evaluation with gpt-3.5-turbo on all 100 samples in the Misconceptions category of TruthfulQA.

To specify model, supply --model <name> using the LiteLLM naming.

To make a shorter test run, supply --num-samples <count>.

To see the generated answer to every question, add --verbose.

Example:

python evaluate_dataset.py model=gpt-4-1106-preview num-samples=3 --verbose

More options: python evaluate_dataset.py --help.

Evaluation

Tests can be run individually as such:

python evaluate_dataset.py --dataset-file 'datasets/crafted_dataset_unfiltered.jsonl' --model davinci-002
python evaluate_dataset.py --dataset-file 'datasets/crafted_dataset_unfiltered.jsonl' --model gpt-3.5-turbo
python evaluate_dataset.py --dataset-file 'datasets/generated_dataset_unfiltered.csv' --model davinci-002

More simply, all combinations can be executed using the test runner:

python run_evaluations.py

To combine the results of runs into a CSV collected_results.csv, run

python collect_results.py

Running HuggingFace models

Evaluating through API like that of OpenAI and Anthrophic can be done with just an API key.

It is also possible to use a dedicated inference endpoint by updating the data structure model_name2endpoint.

In order to evaluate a self-hosted Huggingface model however, the model must be served separately such as with Oobabooga.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published