Code and datasets for retroactive evaluation

Setup

Step 1. Copy .env.template to .env and enter your keys; notably OPENAI_API_KEY.

Step 2. Install Python requirements:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt

Use

python evaluate_dataset.py <optional arguments>

By default, this will run an evaluation with gpt-3.5-turbo on all 100 samples in the Misconceptions category of TruthfulQA.

To specify model, supply --model <name> using the LiteLLM naming.

To make a shorter test run, supply --num-samples <count>.

To see the generated answer to every question, add --verbose.

Example:

python evaluate_dataset.py model=gpt-4-1106-preview num-samples=3 --verbose

More options: python evaluate_dataset.py --help.

Evaluation

Tests can be run individually as such:

python evaluate_dataset.py --dataset-file 'datasets/crafted_dataset_unfiltered.jsonl' --model davinci-002
python evaluate_dataset.py --dataset-file 'datasets/crafted_dataset_unfiltered.jsonl' --model gpt-3.5-turbo
python evaluate_dataset.py --dataset-file 'datasets/generated_dataset_unfiltered.csv' --model davinci-002

More simply, all combinations can be executed using the test runner:

python run_evaluations.py

To combine the results of runs into a CSV collected_results.csv, run

python collect_results.py

Running HuggingFace models

Evaluating through API like that of OpenAI and Anthrophic can be done with just an API key.

It is also possible to use a dedicated inference endpoint by updating the data structure model_name2endpoint.

In order to evaluate a self-hosted Huggingface model however, the model must be served separately such as with Oobabooga.

Name		Name	Last commit message	Last commit date
Latest commit History 289 Commits
.vscode		.vscode
analysis		analysis
data/datasets		data/datasets
notebooks		notebooks
runs		runs
survey-page		survey-page
.env.template		.env.template
.gitignore		.gitignore
README.md		README.md
collect_run_results.py		collect_run_results.py
convert_dataset.py		convert_dataset.py
evaluate_dataset.py		evaluate_dataset.py
generate_dataset.py		generate_dataset.py
odd_one_out.py		odd_one_out.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_evaluations.py		run_evaluations.py
test_capital_questions.py		test_capital_questions.py
truthfulqa_dataset.py		truthfulqa_dataset.py
truthfulqa_evaluation.py		truthfulqa_evaluation.py

cwenner/retrospective-llm-eval

Folders and files

Latest commit

History

Repository files navigation

Code and datasets for retroactive evaluation

Setup

Use

Evaluation

Running HuggingFace models

About

Resources

Stars

Watchers

Forks

Languages