Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation
Deterministically alternate sample order in pairs
Live Anthropic (Claude) comparison for a single pair of samples
Create an Anthropic Message Batch
Download Anthropic Message Batch results (.jsonl)
Retrieve an Anthropic Message Batch by ID
Poll an Anthropic Message Batch until completion
Build Anthropic Message Batch requests from a tibble of pairs
Build Bradley-Terry comparison data from pairwise results
Build EloChoice comparison data from pairwise results
Build Gemini batch requests from a tibble of pairs
Build OpenAI batch JSONL lines for paired comparisons
Build a concrete LLM prompt from a template
Check configured API keys for LLM backends
Check positional bias and bootstrap consistency reliability
Compute consistency between forward and reverse pair comparisons
Internal: Google Gemini API key helper
Internal: parse a Gemini GenerateContentResponse into the standard tib...
Internal: Together.ai API key helper
Ensure only one Ollama model is loaded in memory
Fit a Bradley–Terry model with sirt and fallback to BradleyTerry2
Fit an EloChoice model to pairwise comparison data
Live Google Gemini comparison for a single pair of samples
Create a Gemini Batch job from request objects
Download Gemini Batch results to a JSONL file
Retrieve a Gemini Batch job by name
Poll a Gemini Batch job until completion
Retrieve a named prompt template
List available prompt templates
Backend-agnostic live comparison for a single pair of samples
Extract results from a pairwiseLLM batch object
Submit pairs to an LLM backend via batch API
Create all unordered pairs of writing samples
Live Ollama comparison for a single pair of samples
Live OpenAI comparison for a single pair of samples
Create an OpenAI batch from an uploaded file
Download the output file for a completed batch
Retrieve an OpenAI batch
Poll an OpenAI batch until it completes or fails
Upload a JSONL batch file to OpenAI
Parse Anthropic Message Batch output into a tibble
Parse Gemini batch JSONL output into a tibble of pairwise results
Parse an OpenAI Batch output JSONL file
Randomly assign samples to positions SAMPLE_1 and SAMPLE_2
Read writing samples from a data frame
Read writing samples from a directory of .txt files
Register a named prompt template
Remove a registered prompt template
Run an Anthropic batch pipeline for pairwise comparisons
Run a Gemini batch pipeline for pairwise comparisons
Run a full OpenAI batch pipeline for pairwise comparisons
Randomly sample pairs of writing samples
Sample reversed versions of a subset of pairs
Get or set a prompt template for pairwise comparisons
Live Anthropic (Claude) comparisons for a tibble of pairs
Live Google Gemini comparisons for a tibble of pairs
Backend-agnostic live comparisons for a tibble of pairs
Live Ollama comparisons for a tibble of pairs
Live OpenAI comparisons for a tibble of pairs
Live Together.ai comparisons for a tibble of pairs
Summarize a Bradley–Terry model fit
Live Together.ai comparison for a single pair of samples
Get a trait name and description for prompts
Write an OpenAI batch table to a JSONL file
Provides a unified framework for generating, submitting, and analyzing pairwise comparisons of writing quality using large language models (LLMs). The package supports live and/or batch evaluation workflows across multiple providers ('OpenAI', 'Anthropic', 'Google Gemini', 'Together AI', and locally-hosted 'Ollama' models), includes bias-tested prompt templates and a flexible template registry, and offers tools for constructing forward and reversed comparison sets to analyze consistency and positional bias. Results can be modeled using Bradley–Terry (1952) <doi:10.2307/2334029> or Elo rating methods to derive writing quality scores. For information on the method of pairwise comparisons, see Thurstone (1927) <doi:10.1037/h0070288> and Heldsinger & Humphry (2010) <doi:10.1007/BF03216919>. For information on Elo ratings, see Clark et al. (2018) <doi:10.1371/journal.pone.0190393>.
Useful links