Large Language Model Evaluation
Convert a chat to a solver function
Scoring with string detection
Model-based scoring
Creating and evaluating tasks
Concatenate task samples for analysis
Prepare logs for deployment
The log directory
Interactively view local evaluation logs
vitals: Large Language Model Evaluation
A port of 'Inspect', a widely adopted 'Python' framework for large language model evaluation. Specifically aimed at 'ellmer' users who want to measure the effectiveness of their large language model-based products, the package supports prompt engineering, tool usage, multi-turn dialog, and model graded evaluations.
Useful links