Step 1. Copy .env.template
to .env
and enter your keys; notably OPENAI_API_KEY
.
Step 2. Install Python requirements:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
python evaluate_dataset.py <optional arguments>
By default, this will run an evaluation with gpt-3.5-turbo on all 100 samples in the Misconceptions category of TruthfulQA.
To specify model, supply --model <name>
using the LiteLLM naming.
To make a shorter test run, supply --num-samples <count>
.
To see the generated answer to every question, add --verbose
.
Example:
python evaluate_dataset.py model=gpt-4-1106-preview num-samples=3 --verbose
More options: python evaluate_dataset.py --help
.
Tests can be run individually as such:
python evaluate_dataset.py --dataset-file 'datasets/crafted_dataset_unfiltered.jsonl' --model davinci-002
python evaluate_dataset.py --dataset-file 'datasets/crafted_dataset_unfiltered.jsonl' --model gpt-3.5-turbo
python evaluate_dataset.py --dataset-file 'datasets/generated_dataset_unfiltered.csv' --model davinci-002
More simply, all combinations can be executed using the test runner:
python run_evaluations.py
To combine the results of runs into a CSV collected_results.csv
, run
python collect_results.py
Evaluating through API like that of OpenAI and Anthrophic can be done with just an API key.
It is also possible to use a dedicated inference endpoint by updating the data structure model_name2endpoint
.
In order to evaluate a self-hosted Huggingface model however, the model must be served separately such as with Oobabooga.