Evaluators

Evaluation helpers let you score generated data. Metrics can run inline as columns are produced or over the entire dataset after generation.

Available evaluation functions

Current helpers include exact match, semantic similarity, BLEU score, normalized edit distance and an LLM-as-a-judge metric.

Inline evaluation

Add evaluation functions directly to the dataset schema. Each row will include a score column.

import asyncio
from chatan import dataset, eval, sample

async def main():
    ds = dataset({
        "col1": sample.choice(["a", "a", "b"]),
        "col2": "b",
        "exact_match": eval.exact_match("col1", "col2")
    })

    df = await ds.generate(n=100)
    print(df.head())
    return df

df = asyncio.run(main())

Aggregate evaluation

Metrics can also be aggregated across the whole dataset using Dataset.evaluate. Note: data must be generated first.

# After generating data
aggregate = ds.evaluate({
    "exact_match": ds.eval.exact_match("col1", "col2"),
})
print(aggregate)

Comparing variations

Evaluate multiple columns at once to compare prompts or models.

import asyncio
from chatan import dataset, eval, sample

async def main():
    ds = dataset({
        "sample_1": sample.choice(["a", "a", "b"]),
        "sample_2": sample.choice(["a", "b"]),
        "ground_truth": "b",
    })

    df = await ds.generate(n=100)

    results = ds.evaluate({
        "sample_1_match": ds.eval.exact_match("sample_1", "ground_truth"),
        "sample_2_match": ds.eval.exact_match("sample_2", "ground_truth"),
    })
    return df, results

df, results = asyncio.run(main())

Supported metrics

The evaluate module provides helpers such as exact match, semantic similarity, BLEU score, edit distance and an LLM-as-a-judge metric. Access them through ds.eval for aggregate evaluation or eval for inline use.