Evaluators

Evaluation helpers let you score generated data. Metrics can run inline as columns are produced or over the entire dataset after generation.

Available evaluation functions

Current helpers include exact match, semantic similarity, BLEU score, normalized edit distance and an LLM-as-a-judge metric.

Inline evaluation

Add evaluation functions directly to the dataset schema. Each row will include a score column.

import asyncio
from chatan import dataset, eval, sample

async def main():
    ds = dataset({
        "col1": sample.choice(["a", "a", "b"]),
        "col2": "b",
        "exact_match": eval.exact_match("col1", "col2")
    })

    df = await ds.generate(n=100)
    print(df.head())
    return df

df = asyncio.run(main())

Aggregate evaluation

Metrics can also be aggregated across the whole dataset using Dataset.evaluate. Note: data must be generated first.

# After generating data
aggregate = ds.evaluate({
    "exact_match": ds.eval.exact_match("col1", "col2"),
})
print(aggregate)

Comparing variations

Evaluate multiple columns at once to compare prompts or models.

import asyncio
from chatan import dataset, eval, sample

async def main():
    ds = dataset({
        "sample_1": sample.choice(["a", "a", "b"]),
        "sample_2": sample.choice(["a", "b"]),
        "ground_truth": "b",
    })

    df = await ds.generate(n=100)

    results = ds.evaluate({
        "sample_1_match": ds.eval.exact_match("sample_1", "ground_truth"),
        "sample_2_match": ds.eval.exact_match("sample_2", "ground_truth"),
    })
    return df, results

df, results = asyncio.run(main())

Supported metrics

The evaluate module provides helpers such as exact match, semantic similarity, BLEU score, edit distance and an LLM-as-a-judge metric. Access them through ds.eval for aggregate evaluation or eval for inline use.