Datasets and Generators
Chatan builds datasets from two simple concepts. Generators call large
language models while samplers create structured values. Together they form
the schema for a Dataset.
All generation is async by default, enabling concurrent API calls for faster dataset creation.
Supported generator providers
Chatan includes built-in clients for a few common model sources:
openai- access GPT models via the OpenAI APIanthropic- use Claude models from Anthropictransformers/huggingface- run local HuggingFace models withtransformers
Basic QA Dataset
A minimal dataset uses a single generator to create questions and answers.
import asyncio
import chatan
async def main():
gen = chatan.generator("openai", "YOUR_API_KEY")
ds = chatan.dataset({
"question": gen("write an example question from a 5th grade math test"),
"answer": gen("answer: {question}")
})
df = await ds.generate(n=100)
return df
df = asyncio.run(main())
Creating Data Mixes
Mix generators with samplers to diversify prompts.
import asyncio
from chatan import dataset, generator, sample
async def main():
gen = generator("openai", "YOUR_API_KEY")
mix = [
"san antonio, tx",
"marfa, tx",
"paris, fr"
]
ds = dataset({
"id": sample.uuid(),
"topic": sample.choice(mix),
"prompt": gen("write an example question about the history of {topic}"),
"response": gen("respond to: {prompt}"),
})
df = await ds.generate(n=100)
return df
df = asyncio.run(main())
Dataset Augmentation
Pull rows from existing corpora and ask the model to create new variations.
import asyncio
from datasets import load_dataset
import chatan
async def main():
gen = chatan.generator("openai", "YOUR_API_KEY")
hf_data = load_dataset("some/dataset")
ds = chatan.dataset({
"original_prompt": chatan.sample.from_dataset(hf_data, "prompt"),
"variation": gen("rewrite this prompt: {original_prompt}"),
"response": gen("respond to: {variation}")
})
df = await ds.generate(n=100)
return df
df = asyncio.run(main())
Saving Datasets
After generation, datasets can be saved or converted to other formats.
import asyncio
async def main():
# ... define ds ...
# Generate
df = await ds.generate(n=1000)
# Save to various formats
ds.save("my_dataset.parquet")
ds.save("my_dataset.csv", format="csv")
# Convert to HuggingFace format
hf_dataset = ds.to_huggingface()
return df
df = asyncio.run(main())
Advanced Examples
The snippets below show more complex recipes and local model usage.
Dataset Triton
import asyncio
import pandas as pd
from datasets import load_dataset
from chatan import generator, dataset, sample
async def main():
gen = generator("openai", "YOUR_API_KEY")
kernelbook = load_dataset("GPUMODE/KernelBook")
kernelbench = load_dataset("ScalingIntelligence/KernelBench")
ds_1 = dataset({
"operation": sample.from_dataset(kernelbench, "id"),
"prompt": gen("write a prompt asking for a Triton kernel for: {operation}"),
"response": gen("{prompt}")
})
ds_2 = dataset({
"original_prompt": sample.from_dataset(kernelbook, "python_code"),
"prompt": gen("write a question asking for this code to be written as a Triton kernel"),
"response": gen("{prompt}")
})
df_1 = await ds_1.generate(n=500)
df_2 = await ds_2.generate(n=500)
combined_df = pd.concat([df_1, df_2], ignore_index=True)
return combined_df
combined_df = asyncio.run(main())
Complex Mixes
import asyncio
from chatan import generator, dataset, sample
async def main():
gen = generator("openai", "YOUR_API_KEY")
mixed_ds = dataset({
"dataset_type": sample.choice(["kernelbench", "kernelbook"]),
"operation": sample.from_dataset(kernelbench, "id"),
"original_code": sample.from_dataset(kernelbook, "python_code"),
"prompt": gen("""
{%- if dataset_type == "kernelbench" -%}
write a prompt asking for a Triton kernel for: {operation}
{%- else -%}
write a question asking for this code to be written as a Triton kernel: {original_code}
{%- endif -%}
"""),
"response": gen("{prompt}")
})
final_df = await mixed_ds.generate(n=1000)
mixed_ds.save("triton_kernel_dataset.parquet")
return final_df
final_df = asyncio.run(main())
Transformers Local Generation
import asyncio
from chatan import generator, dataset, sample
async def main():
# Use a local HuggingFace model
gen = generator("transformers", model="gpt2")
ds = dataset({
"topic": sample.choice(["space", "history", "science"]),
"prompt": gen("Ask a short question about {topic}"),
"response": gen("{prompt}")
})
df = await ds.generate(n=5)
return df
df = asyncio.run(main())