Evaluators
Evaluation helpers let you score generated data. Metrics can run inline as columns are produced or over the entire dataset after generation.
Available evaluation functions
Current helpers include exact match, semantic similarity, BLEU score, normalized edit distance and an LLM-as-a-judge metric.
Inline evaluation
Add evaluation functions directly to the dataset schema. Each row will include a score column.
import asyncio
from chatan import dataset, eval, sample
async def main():
ds = dataset({
"col1": sample.choice(["a", "a", "b"]),
"col2": "b",
"exact_match": eval.exact_match("col1", "col2")
})
df = await ds.generate(n=100)
print(df.head())
return df
df = asyncio.run(main())
Aggregate evaluation
Metrics can also be aggregated across the whole dataset using
Dataset.evaluate. Note: data must be generated first.
# After generating data
aggregate = ds.evaluate({
"exact_match": ds.eval.exact_match("col1", "col2"),
})
print(aggregate)
Comparing variations
Evaluate multiple columns at once to compare prompts or models.
import asyncio
from chatan import dataset, eval, sample
async def main():
ds = dataset({
"sample_1": sample.choice(["a", "a", "b"]),
"sample_2": sample.choice(["a", "b"]),
"ground_truth": "b",
})
df = await ds.generate(n=100)
results = ds.evaluate({
"sample_1_match": ds.eval.exact_match("sample_1", "ground_truth"),
"sample_2_match": ds.eval.exact_match("sample_2", "ground_truth"),
})
return df, results
df, results = asyncio.run(main())
Supported metrics
The evaluate module provides helpers such as exact match, semantic
similarity, BLEU score, edit distance and an LLM-as-a-judge metric. Access them
through ds.eval for aggregate evaluation or eval for inline use.