--- sidebar_position: 0 sidebar_label: Testing LLM Chains slug: /configuration/testing-llm-chains description: Learn how to test complex LLM chains and RAG systems with unit tests and end-to-end validation to ensure reliable outputs and catch failures across multi-step prompts --- # Testing LLM chains Prompt chaining is a common pattern used to perform more complex reasoning with LLMs. It's used by libraries like [LangChain](https://langchain.readthedocs.io/), and OpenAI has released built-in support via [OpenAI functions](https://openai.com/blog/function-calling-and-other-api-updates). A "chain" is defined by a list of LLM prompts that are executed sequentially (and sometimes conditionally). The output of each LLM call is parsed/manipulated/executed, and then the result is fed into the next prompt. This page explains how to test an LLM chain. At a high level, you have these options: - Break the chain into separate calls, and test those. This is useful if your testing strategy is closer to unit tests, rather than end to end tests. - Test the full end-to-end chain, with a single input and single output. This is useful if you only care about the end result, and are not interested in how the LLM chain got there. ## Unit testing LLM chains As mentioned above, the easiest way to test is one prompt at a time. This can be done pretty easily with a basic promptfoo [configuration](/docs/configuration/guide). Create a `promptfooconfig.yaml` for the first step of your chain. After configuring test cases for that step, create a new set of test cases for step 2 and so on. ## End-to-end testing for LLM chains ### Using a script provider To test your chained LLMs, provide a script that takes a prompt input and outputs the result of the chain. This approach is language-agnostic. In this example, we'll test LangChain's LLM Math plugin by creating a script that takes a prompt and produces an output: ```python # langchain_example.py import sys import os from langchain_openai import OpenAI from langchain.chains.llm_math.base import LLMMathChain llm = OpenAI( temperature=0, api_key=os.getenv('OPENAI_API_KEY') ) llm_math = LLMMathChain.from_llm(llm=llm) prompt = sys.argv[1] print(llm_math.run(prompt)) ``` This script is set up so that we can run it like this: ```sh python langchain_example.py "What is 2+2?" ``` Now, let's configure promptfoo to run this LangChain script with a bunch of test cases: ```yaml prompts: file://prompt.txt providers: - openai:chat:gpt-5.4 - exec:python langchain_example.py tests: - vars: question: What is the cube root of 389017? - vars: question: If you have 101101 in binary, what number does it represent in base 10? - vars: question: What is the natural logarithm (ln) of 89234? - vars: question: If a geometric series has a first term of 3125 and a common ratio of 0.008, what is the sum of the first 20 terms? - vars: question: A number in base 7 is 3526. What is this number in base 10? - vars: question: If a complex number is represented as 3 + 4i, what is its magnitude? - vars: question: What is the fourth root of 1296? ``` For an in-depth look at configuration, see the [guide](/docs/configuration/guide). Note the following: - **prompts**: `prompt.txt` is just a file that contains `{{question}}`, since we're passing the question directly through to the provider. - **providers**: We list GPT-5.4 in order to compare its outputs with LangChain's LLMMathChain. We also use the `exec` directive to make promptfoo run the Python script in its eval. In this example, the end result is a side-by-side comparison of GPT-5.4 vs. LangChain math performance: ![langchain eval](/img/docs/langchain-eval.png) View the [full example on Github](https://github.com/promptfoo/promptfoo/tree/main/examples/integration-langchain). ### Using a custom provider For finer-grained control, use a [custom provider](/docs/providers/custom-api). A custom provider is a short Javascript file that defines a `callApi` function. This function can invoke your chain. Even if your chain is not implemented in Javascript, you can write a custom provider that shells out to Python. In the example below, we set up a custom provider that runs a Python script with a prompt as the argument. The output of the Python script is the final result of the chain. ```js title="chainProvider.js" const { spawn } = require('child_process'); class ChainProvider { id() { return 'my-python-chain'; } async callApi(prompt, context) { return new Promise((resolve, reject) => { const pythonProcess = spawn('python', ['./path_to_your_python_chain.py', prompt]); let output = ''; pythonProcess.stdout.on('data', (data) => { output += data.toString(); }); pythonProcess.stderr.on('data', (data) => { reject(data.toString()); }); pythonProcess.on('close', (code) => { if (code !== 0) { reject(`python script exited with code ${code}`); } else { resolve({ output, }); } }); }); } } module.exports = ChainProvider; ``` Note that you can always write the logic directly in Javascript if you're comfortable with the language. Now, we can set up a promptfoo config pointing to `chainProvider.js`: ```yaml prompts: - file://prompt1.txt - file://prompt2.txt // highlight-start providers: - './chainProvider.js' // highlight-end tests: - vars: language: French input: Hello world - vars: language: German input: How's it going? ``` promptfoo will pass the full constructed prompts to `chainProvider.js` and the Python script, with variables substituted. In this case, the script will be called _# prompts_ \* _# test cases_ = 2 \* 2 = 4 times. Using this approach, you can test your LLM chain end-to-end, view results in the [web view](/docs/usage/web-ui), set up [continuous testing](/docs/integrations/github-action), and so on. ## Retrieval-augmented generation (RAG) For more detail on testing RAG pipelines, see [RAG evaluations](/docs/guides/evaluate-rag). ## Other tips To reference the outputs of previous test cases, use the built-in [`_conversation` variable](/docs/configuration/chat#using-the-conversation-variable).