How to Create Synthetic Data for RAG Evaluation

Learn how to use DeepEval to create test datasets.

Oct 01, 2025

Image by DeepEval

Introduction

Evaluating your LLM app, for example a RAG-based chatbot, is a crucial step many teams overlook.

It’s important to know how well your RAG pipeline performs: how accurate is retrieval? Does it return the correct documents? If so, how good are the answers the LLM generates? Does the model receive too much context, or not enough?

You can’t answer those questions without evaluation.

But how do you evaluate a RAG pipeline when you’re just getting started and don’t yet have users?

You can onboard early users and collect feedback while you build, but that rarely provides a reliable baseline. You still need an initial dataset.

💡 Tip: If you don’t yet have production data, start with synthetic datasets. They give you a baseline before real users arrive.

A practical solution is to synthesize test data, a common approach when no evaluation dataset exists.

In this post we’ll explore how to generate an evaluation dataset with DeepEval using Python.

Create Synthesized Data

DeepEval is an open-source LLM evaluation framework, think of it as “pytest for LLMs”, that lets developers unit-test LLM outputs much like traditional software tests.

DeepEval can also synthesize evaluation data, which is what we’ll use here.

Let’s get started.

First, install the dependencies:

$ pip install deepeval langchain langchain-community langchain-text-splitters chromadb tiktoken

Then create example.txt with sample content (the exact text doesn’t matter):

An octopus has three hearts and its blood is blue because it uses a copper-based protein called hemocyanin to transport oxygen. In the animal kingdom, other unique biological traits exist, such as the cube-shaped feces produced by wombats, which helps them mark their territory by preventing the droppings from rolling away. Shifting from biology to history, the shortest war ever recorded occurred between Britain and Zanzibar on August 27, 1896, and lasted for 38 minutes. In the realm of technology, the first video cassette recorder, introduced in 1956, was the size of a piano. The world of botany also contains surprises; for example, bananas are technically classified as berries, while strawberries are not. Expanding our view to the cosmos, a day on Venus is longer than its entire year due to its slow rotation speed. Back on Earth, the human brain operates with efficiency, running on about 20 watts of power, which is less than many refrigerator light bulbs.

Next, create a Synthesizer instance and call generate_goldens_from_docs():

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer(model=”gpt-4.1-nano”)

synthesizer.generate_goldens_from_docs(
    document_paths=[”example.txt”],
    include_expected_output=True
)

print(synthesizer.synthetic_goldens)

Here is one example of the output:

Golden(input=’Explain why octopus blood is blue.’, actual_output=None, expected_output=’Octopus blood is blue because it uses hemocyanin, a copper-based protein, to transport oxygen, which gives it a blue color.’, context=[’An octopus has three hearts and its blood is blue because it uses a copper-based protein called hemocyanin to transport oxygen. In the animal kingdom, other unique biological traits exist, such as the cube-shaped feces produced by wombats, which helps them mark their territory by preventing the droppings from rolling away. Shifting from biology to history, the shortest war ever recorded occurred between Britain and Zanzibar on August 27, 1896, and lasted for 38 minutes. In the realm of technology, the first video cassette recorder, introduced in 1956, was the size of a piano. The world of botany also contains surprises; for example, bananas are technically classified as berries, while strawberries are not. Expanding our view to the cosmos, a day on Venus is longer than its entire year due to its slow rotation speed. Back on Earth, the human brain operates with efficiency, running on about 20 watts of power, which is less than many refrigerator light bulbs.’], retrieval_context=None, turns=None, additional_metadata={’evolutions’: [’In-Breadth’], ‘synthetic_input_quality’: 1.0}, comments=None, tools_called=None, expected_tools=None, source_file=’example.txt’, name=None, custom_column_key_values=None)

⚠️ Watch out: The quality of synthetic data depends heavily on the source text you provide. Garbage in means garbage out.

How It Works

So, how does generate_goldens_from_docs() work?

Document parsing: The provided documents are split into chunks and inserted into Chroma DB.
Context selection: Random chunks are selected and evaluated by an LLM to ensure they are understandable, structured, relevant, and detailed.
Context grouping: The tool groups similar chunks together using a similarity score.
Generation: The LLM generates the synthetic input and the expected output.

💡 Tip: For more control, pass a ContextConstructionConfig() instance to tweak how context is created.

Fine-tune Synthesized Data Generation

There are three ways to customize generation:

Filtration
Evolution
Styling

Filtration

Filtration controls how strict the critic model should be when evaluating the quality of synthetic inputs.

from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import FiltrationConfig

filtration_config = FiltrationConfig(
    critic_model=”gpt-4.1-nano”,
    synthetic_input_quality_threshold=0.7,
    max_quality_retries=3
)

synthesizer = Synthesizer(filtration_config=filtration_config)

critic_model: Model used to evaluate quality
synthetic_input_quality_threshold: Minimum quality score required for synthetic inputs
max_quality_retries: Number of times input generation will be retried

💡 Tip: Start with a lower threshold to quickly generate data, then tighten quality controls once your pipeline is stable.

Evolution

Evolution lets you control how complex the generated inputs become by weighting different evolution types.

from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import EvolutionConfig, Evolution

evolution_config = EvolutionConfig(
    evolutions={
        Evolution.REASONING: 1/7,
        Evolution.MULTICONTEXT: 1/7,
        Evolution.CONCRETIZING: 1/7,
        Evolution.CONSTRAINED: 1/7,
        Evolution.COMPARATIVE: 1/7,
        Evolution.HYPOTHETICAL: 1/7,
        Evolution.IN_BREADTH: 1/7,
    },
    num_evolutions=4
)

synthesizer = Synthesizer(evolution_config=evolution_config)

Here, num_evolutions controls how many evolutions will be applied.

💡 Tip: Mix multiple evolution types to make tests more realistic and avoid overfitting to one style of question.

Styling

Styling customizes the formats and tone of generated inputs and expected outputs, useful when you have specific requirements.

from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import StylingConfig

styling_config = StylingConfig(
    input_format=”Python code snippet as a string.”,
    expected_output_format=”Beginner-friendly explanation of what the code does.”,
    task=”Explain Python code snippets in plain English for novices.”,
    scenario=”Non-technical learners want to understand Python code examples step-by-step.”
)

synthesizer = Synthesizer(styling_config=styling_config)

input_format: Desired format for generated inputs
expected_output_format: How expected outputs should be formatted
task: Purpose of the LLM application you’re evaluating
scenario: The setting in which the LLM will be used

⚠️ Watch out: If you skip styling, your synthetic data may not reflect the real-world use case you care about.

Conclusion

Getting evaluation data for a RAG app is hard, especially if you haven’t started building the app. With DeepEval, you get a practical starting point to create your own evaluation dataset and gauge how well your initial approaches perform.

Python & Chill

Discussion about this post

Ready for more?