Prompt engineering does not scale. You tune the retrieval prompt, the answer prompt breaks. You fix the answer prompt, the citation prompt regresses. In a compound pipeline with five or more modules, manual tuning is a never-ending whack-a-mole game with no systematic exit condition.
DSPy gives you the exit condition: treat the pipeline as a program, define a metric, and let the optimizer find the prompts automatically.
Why Compound AI Systems Break Prompt Engineering
A single-module LLM call is easy to tune. You write a prompt, run it on 20 examples, adjust the wording, and ship. The feedback loop is short enough to manage by hand.
Compound pipelines change the math. A RAG system with a query rewriter, retriever, context ranker, answer generator, and citation verifier has at least five prompts that interact. Improving the query rewriter changes what the retriever sees, which changes what the context ranker scores, which changes what the answer generator receives. You cannot optimize one module without touching the others.
The combinatorial surface is also enormous. If each of five modules has 10 reasonable prompt candidates, the search space is 10^5 configurations. Systematic search through that space by hand is not practical.
DSPy's thesis is that LLM pipeline construction is a compilation and optimization problem, not a manual engineering problem. You define typed Signatures (what goes in, what comes out), assemble them into Modules, write a metric function, and let an optimizer like MIPROv2 search over prompt candidates and few-shot demonstrations end-to-end.
DSPy 3.x Core Primitives
DSPy 3.x reorganized its API around three concepts that map cleanly onto the compile-and-optimize mental model.
Signatures
A Signature is a typed input/output contract for a single LLM call. It specifies what fields go in, what fields come out, and a natural-language description for each field that seeds the prompt template.
import dspy
class QuestionToContext(dspy.Signature):
"""Rewrite the user question as a precise search query."""
question: str = dspy.InputField(desc="The original user question")
search_query: str = dspy.OutputField(desc="A precise search query for retrieval")
class ContextToAnswer(dspy.Signature):
"""Answer the question using only the provided context."""
question: str = dspy.InputField(desc="The original user question")
context: str = dspy.InputField(desc="Retrieved passages from the knowledge base")
answer: str = dspy.OutputField(desc="A concise, factual answer citing the context")
citations: list[str] = dspy.OutputField(desc="List of source passage IDs used")The field descriptions are seed prompt text. The optimizer treats them as starting points and searches for better instruction text during compilation.
Modules
A Module is a composable unit that wraps one or more Signatures and defines how they chain together. DSPy ships three built-in wrappers:
dspy.Predict: direct LLM call, no chain-of-thoughtdspy.ChainOfThought: adds a reasoning field before the outputdspy.ReAct: tool-augmented agent loop
Modules compose naturally: a Module can contain other Modules, and the optimizer traces through the full graph.
class RAGPipeline(dspy.Module):
def __init__(self, retrieve_fn):
super().__init__()
self.retrieve = retrieve_fn
self.rewrite = dspy.ChainOfThought(QuestionToContext)
self.answer = dspy.ChainOfThought(ContextToAnswer)
def forward(self, question: str) -> dspy.Prediction:
rewritten = self.rewrite(question=question)
passages = self.retrieve(rewritten.search_query, k=5)
context = "\n\n".join(passages)
return self.answer(question=question, context=context)Optimizers: BootstrapFewShot, COPRO, and MIPROv2
DSPy ships three main optimizers with different tradeoffs:
| Optimizer | Optimizes | Training data needed | Compute cost | When to use |
|---|---|---|---|---|
| BootstrapFewShot | Demonstrations (few-shot examples) | 20+ examples | Low | Quick wins, limited data |
| COPRO | Instructions (prompt text) | 50+ examples | Medium | When instruction quality is the bottleneck |
| MIPROv2 | Instructions + demonstrations jointly | 100+ examples | High | Best overall quality, sufficient data |
BootstrapFewShot generates few-shot demonstrations by running the pipeline on training examples and keeping correct traces. It is fast, requires minimal data, and often gives a 5-15% metric improvement with no hyperparameter tuning.
COPRO uses a meta-LLM to propose and evaluate alternative instruction text for each module. It is slower than BootstrapFewShot but produces cleaner prompts that generalize better to out-of-distribution inputs.
MIPROv2 (Multi-prompt Instruction PRoposal Optimizer v2) combines both approaches. It runs a Bayesian search over a joint space of instruction candidates and demonstration sets, using the metric function to score each configuration on a held-out dev set. For a well-specified pipeline with 100+ training examples, MIPROv2 consistently outperforms the other two optimizers.
Why Self-Hosted LLMs Are Non-Negotiable for Serious Optimization
Three concrete reasons self-hosting matters for DSPy optimization.
Cost. MIPROv2 with num_trials=50 on a 5-module pipeline burns 2,000-8,000 inference calls per compilation run. At commercial API rates for a 70B-class model, a single optimization run costs $50-$300 depending on average context length. At those prices, you iterate once and call it done. On a self-hosted H100 GPU rental on Spheron, the same run costs $8-$40 in GPU time. You can afford to iterate. For a concrete cost comparison, see the GPU cost-per-token benchmarks covering on-demand vs self-hosted inference economics.
Rate limits. Commercial APIs throttle high-throughput optimization loops. A 5-module pipeline evaluating 100 training examples across 50 trials generates bursts of hundreds of concurrent requests. Most commercial API tiers queue these requests, turning a 90-minute optimization run into 6+ hours of wall-clock time. Self-hosted endpoints have no per-account rate limits: the only throttle is GPU memory and compute.
Reproducibility. Self-hosted models with temperature=0 produce deterministic outputs. When comparing prompt candidate A against prompt candidate B, you need the inference engine to be a constant. Commercial APIs offer no determinism guarantees: sampling randomness, model version updates, and infrastructure-level changes all introduce noise into the comparison signal. Deterministic inference makes the optimizer's Bayesian search reliable.
Architecture: DSPy Compiler + Target Model + Judge Model on Spheron
The production DSPy setup has three components:
DSPy compiler script. Runs on any Python environment, including your laptop or a small CPU VM. It orchestrates the optimization loop: proposes instruction candidates, assembles configurations, evaluates them via the metric function, and updates the Bayesian search state. It makes inference calls to the target model endpoint but does not itself require a GPU.
Target model. A self-hosted LLM running on an H100 GPU rental on Spheron. This is the model being optimized: the one whose prompts and few-shot demos MIPROv2 is tuning. For most production pipelines, Llama 4 Scout or Qwen 3.6 Plus is the right starting point.
Judge model (optional but recommended). A separate LLM running on an A100 instance for the judge model, used as the metric function for open-ended generation tasks. The judge model must be separate from the target model to avoid self-preference bias (a model consistently rating its own outputs higher than a neutral judge would). For factual tasks with ground-truth labels, a Python metric function is sufficient and you can skip the judge model entirely.
The compiler script talks to both the target model and the judge model via OpenAI-compatible HTTP endpoints. Both endpoints can run on the same Spheron private network, keeping inference round-trip latency low.
Deploying Your DSPy Backend Models on Spheron
Llama 4 Scout as the Target Model
Llama 4 Scout is a 17B-active-parameter MoE model (109B total). All expert weights must reside in VRAM for routing, so the full model in BF16 requires ~218GB, well above a single GPU. With INT4 quantization (~55GB), it fits on a single H100 PCIe (80GB). For full deployment steps, see the Llama 4 GPU deployment guide. For the DSPy compiler, configure the endpoint with:
import dspy
# Point DSPy at your self-hosted vLLM endpoint
lm = dspy.LM(
model="openai/meta-llama/Llama-4-Scout-17B-16E-Instruct",
api_base="http://<your-spheron-instance-ip>:8000/v1",
api_key="none",
temperature=0.0, # deterministic for optimizer reliability
max_tokens=1024,
)
dspy.configure(lm=lm)Llama 4 Scout's strong instruction-following performance makes it a good optimization target: MIPROv2's proposed instruction candidates are diverse enough to stress-test the model's ability to follow precise formatting constraints, and Scout handles this well.
Qwen 3.6 Plus for Instruction-Following Tasks
Qwen 3.6 Plus is particularly strong at following complex, multi-constraint instructions, which is valuable when COPRO or MIPROv2 generates elaborate prompt candidates with multiple output format requirements. See the Qwen 3.6 Plus deployment guide for vLLM setup. The DSPy configuration is identical in structure:
lm = dspy.LM(
model="openai/Qwen/Qwen3.6Plus",
api_base="http://<your-spheron-instance-ip>:8000/v1",
api_key="none",
temperature=0.0,
max_tokens=1024,
)
dspy.configure(lm=lm)GPT-OSS 120B for Highest Accuracy Pipelines
GPT-OSS 120B is the highest-capacity open-weight option in this comparison. Its MoE architecture keeps active parameter count around 5.1B per forward pass, which means it fits on a single H100 PCIe at MXFP4. For tasks where answer accuracy is the dominant metric and you have the GPU budget for it, GPT-OSS 120B as the optimization target typically yields the highest post-optimization scores. See the GPT-OSS deployment guide for setup. For serving at full capacity or with larger context windows, an H200 GPU rental gives more headroom.
Running MIPROv2 on a RAG Pipeline: Full Walkthrough
Dataset and Metric Setup
MIPROv2 needs a labeled training set and a metric function that returns a float between 0 and 1. For a RAG pipeline, each training example needs a question and an expected answer.
import dspy
from datasets import load_dataset
# Load HotpotQA as a quick starting dataset
hotpot = load_dataset("hotpot_qa", "fullwiki", split="train[:200]")
trainset = [
dspy.Example(
question=row["question"],
expected_answer=row["answer"]
).with_inputs("question")
for row in hotpot
]
devset = trainset[:50]
trainset = trainset[50:]
# LLM-as-judge metric (see /blog/llm-as-judge-evaluation-pipeline-gpu-cloud/ for full setup)
judge_lm = dspy.LM(
model="openai/meta-llama/Llama-4-Scout-17B-16E-Instruct",
api_base="http://<judge-instance-ip>:8000/v1",
api_key="none",
temperature=0.0,
)
def answer_correctness_metric(example, prediction, trace=None):
"""Score answer correctness with an LLM judge. Returns float 0-1."""
with dspy.context(lm=judge_lm):
verdict = dspy.Predict("question, gold_answer, predicted_answer -> correct: bool")(
question=example.question,
gold_answer=example.expected_answer,
predicted_answer=prediction.answer,
)
return float(verdict.correct)For building the full judge pipeline, the guide on LLM-as-judge evaluation on GPU cloud covers rubric design, bias mitigation, and vLLM deployment for the judge endpoint.
Defining the RAG Module
import dspy
import faiss
import numpy as np
class QueryRewriter(dspy.Signature):
"""Rewrite the question as a concise search query."""
question: str = dspy.InputField()
search_query: str = dspy.OutputField()
class AnswerWithContext(dspy.Signature):
"""Answer using only the retrieved context. Be concise and factual."""
question: str = dspy.InputField()
context: str = dspy.InputField(desc="Retrieved passages, one per line")
answer: str = dspy.OutputField(desc="One to three sentence answer")
citations: list[str] = dspy.OutputField(desc="List of source passage IDs used")
class RAGPipeline(dspy.Module):
def __init__(self, index, passages, embed_fn):
super().__init__()
self.index = index
self.passages = passages
self.embed = embed_fn
self.rewrite = dspy.ChainOfThought(QueryRewriter)
self.answer = dspy.ChainOfThought(AnswerWithContext)
def forward(self, question: str) -> dspy.Prediction:
rewritten = self.rewrite(question=question)
query_vec = self.embed(rewritten.search_query)
_, ids = self.index.search(np.array([query_vec]), k=5)
context = "\n".join(self.passages[i] for i in ids[0] if i >= 0)
return self.answer(question=question, context=context)
rag = RAGPipeline(index=faiss_index, passages=passage_list, embed_fn=embed)Running the Optimizer
optimizer = dspy.MIPROv2(
metric=answer_correctness_metric,
auto="medium", # controls num_candidates and num_trials automatically
num_threads=8, # parallel eval threads (tune to your GPU throughput)
verbose=True,
)
compiled_rag = optimizer.compile(
rag,
trainset=trainset,
num_trials=50, # Bayesian search iterations
max_bootstrapped_demos=3,
max_labeled_demos=4,
eval_kwargs={"num_threads": 8, "display_progress": True},
)
compiled_rag.save("optimized_rag_v1.json")With Llama 4 Scout as the target model on an H100 PCIe (INT4), a 2-module pipeline with 100 training examples and num_trials=50 takes roughly 45-90 minutes wall-clock, depending on average response length. A 5-module pipeline scales to 3-6 hours.
Interpreting Optimization Results
After compilation, inspect the optimized prompts and demonstrations:
# View the optimized instruction for the query rewriter
print(compiled_rag.rewrite.extended_signature.instructions)
# View bootstrapped few-shot examples
for demo in compiled_rag.rewrite.demos:
print(f"Q: {demo.question}")
print(f"Query: {demo.search_query}\n")
# Evaluate on dev set
from dspy.evaluate import Evaluate
evaluate = Evaluate(devset=devset, metric=answer_correctness_metric, num_threads=8)
baseline_score = evaluate(rag)
optimized_score = evaluate(compiled_rag)
print(f"Baseline: {baseline_score:.1%}")
print(f"Optimized: {optimized_score:.1%}")In practice, MIPROv2 consistently delivers 10-20 percentage point metric improvements on RAG pipelines with 100+ training examples. A typical result: RAG answer correctness from 61.2% to 74.8% on HotpotQA after MIPROv2 with Llama 4 Scout as both target and judge, with the optimized prompts emphasizing evidence citation and direct factual answers over elaboration.
Cost Breakdown: Self-Hosted vs Commercial API
Running MIPROv2 on a 2-module RAG pipeline with 100 training examples and num_trials=50:
| Approach | Inference calls | Avg tokens/call | Total tokens | Approx cost | Wall-clock time |
|---|---|---|---|---|---|
| GPT-4o API | ~5,000 | 2,000 | 10M | $50-$100 | 3-8 hrs (rate limits) |
| Claude 3.5 Sonnet API | ~5,000 | 2,000 | 10M | $40-$90 | 3-8 hrs (rate limits) |
| Self-hosted Llama 4 Scout, H100 PCIe ($2.01/hr) | ~5,000 | 2,000 | 10M | ~$2-$5 | 45-90 min |
| Self-hosted Qwen 3.6 Plus, H100 PCIe ($2.01/hr) | ~5,000 | 2,000 | 10M | ~$5-$10 | 60-120 min |
Self-hosting reduces cost by 6-10x and cuts wall-clock time by 3-5x by eliminating API rate-limit queuing.
Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Runtime Constraints and Metric-Driven Refinement
DSPy 3.x replaces the deprecated dspy.Assert and dspy.Suggest APIs (removed in 3.x) with two module wrappers: dspy.Refine for hard constraints and dspy.BestOfN for soft quality scoring.
dspy.Refine wraps a module and retries the forward call until the reward function is satisfied or the attempt limit is reached. dspy.BestOfN runs the module N times and returns the output with the highest reward score. Both take a reward_fn that maps inputs and a prediction to a float between 0 and 1.
import dspy
import numpy as np
def query_conciseness_reward(args, prediction, trace=None):
"""Returns 1.0 if the search query is no longer than the input question."""
question = args.get("question", "")
return float(len(prediction.search_query.split()) <= len(question.split()))
def answer_conciseness_reward(args, prediction, trace=None):
"""Returns 1.0 if the answer is under 100 words."""
return float(len(prediction.answer.split()) < 100)
class RAGWithConstraints(dspy.Module):
def __init__(self, index, passages, embed_fn):
super().__init__()
self.index = index
self.passages = passages
self.embed = embed_fn
self.rewrite = dspy.Refine(
module=dspy.ChainOfThought(QueryRewriter),
N=3,
reward_fn=query_conciseness_reward,
)
self.answer = dspy.BestOfN(
module=dspy.ChainOfThought(AnswerWithContext),
N=3,
reward_fn=answer_conciseness_reward,
)
def forward(self, question: str) -> dspy.Prediction:
rewritten = self.rewrite(question=question)
query_vec = self.embed(rewritten.search_query)
_, ids = self.index.search(np.array([query_vec]), k=5)
context = "\n".join(self.passages[i] for i in ids[0] if i >= 0)
return self.answer(question=question, context=context)Use dspy.Refine when a constraint is structural and must be satisfied before the result is usable. Use dspy.BestOfN when any output is acceptable but you want the highest-scoring one from multiple samples.
For building the judge metric that feeds reward scoring, the guide on LLM-as-judge evaluation pipelines on GPU cloud covers the full evaluation infrastructure.
DSPy vs LangChain vs LlamaIndex: When Each Wins
| Capability | DSPy | LangChain | LlamaIndex |
|---|---|---|---|
| Manual prompt control | Low (abstracted away) | High | Medium |
| Automatic prompt optimization | Yes (core feature) | No | Limited |
| Built-in RAG abstractions | Basic | Extensive | Extensive |
| Multi-agent orchestration | Via dspy.ReAct | Via LangGraph | Via agent workflows |
| Learning curve | Medium-high | Medium | Medium |
| Best for | Systematic optimization of compound pipelines | Flexible chaining with large ecosystem | Data ingestion, indexing, and retrieval |
DSPy is not a replacement for LangChain or LlamaIndex. Use DSPy when you have a compound pipeline with a measurable metric and you need systematic optimization. Use LangChain when you need flexibility, a large third-party integration ecosystem, or established agent orchestration patterns. Use LlamaIndex when data ingestion and retrieval quality are the main concern.
Many production teams use all three: LlamaIndex for the data pipeline, LangChain for orchestration, and DSPy to optimize the prompts within each module.
Production Deployment Patterns
Compiled Program Artifacts and Versioning
DSPy programs serialize to JSON. Treat these files as model artifacts and version them alongside the base model checkpoint.
# Save with metadata
compiled_rag.save("optimized_rag_v1.json")
# Recommended naming: model, date, dev-set metric score
# e.g., optimized_rag_llama4scout_20260423_748.json
# (748 = 74.8% dev metric)Load in production:
rag = RAGPipeline(index=index, passages=passages, embed_fn=embed)
rag.load("optimized_rag_llama4scout_20260423_748.json")Store versioned artifacts in S3 or a model registry. When the base model is updated or fine-tuned, the compiled artifact needs reoptimization against the new checkpoint since the frozen prompts were tuned to the previous model's behavior.
Serving a Compiled DSPy Program
A compiled DSPy program is a stateless callable. At serve time, it makes inference calls to the target LLM endpoint and nothing else. No optimizer dependencies are needed at inference time.
Wrap it in FastAPI:
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel
import dspy
app = FastAPI()
class QuestionRequest(BaseModel):
question: str
# Load once at startup
lm = dspy.LM(model="openai/meta-llama/Llama-4-Scout-17B-16E-Instruct",
api_base="http://llm-endpoint:8000/v1", api_key="none")
dspy.configure(lm=lm)
rag = RAGPipeline(index=index, passages=passages, embed_fn=embed)
rag.load("optimized_rag_llama4scout_20260423_748.json")
@app.post("/answer")
async def answer(request: QuestionRequest):
# Wrap in dspy.context so each thread gets its own LM reference,
# preventing bleed from concurrent dspy.context() overrides in other threads.
def run_rag():
with dspy.context(lm=lm):
return rag(question=request.question)
result = await asyncio.to_thread(run_rag)
return {"answer": result.answer, "citations": result.citations}For multi-model pipelines or queue-depth autoscaling, wrap the compiled program in a Ray Serve deployment instead. The compiled DSPy program maps cleanly onto a Ray Serve Deployment class: it is stateless, CPU-bound at serve time, and its only external dependency is the vLLM endpoint.
Drift Monitoring and Re-Optimization
Production LLM responses drift as models are updated or fine-tuned. Monitor your production metric in a sliding window.
- Route 1-2% of production queries through the same LLM-as-judge metric used during optimization.
- Track the 7-day rolling average of the judge score.
- When the rolling metric drops 5%+ from the post-optimization baseline, trigger a re-optimization job on Spheron spot H100s.
- Load the current checkpoint's compiled artifact as the starting program and run MIPROv2 with the new training data collected since the last optimization.
Keep optimization runs on spot instances to minimize cost. The optimization script checkpoints to disk after each trial, so spot preemption only costs the current trial, not the full run.
End-to-End Cost: Optimizing a 5-Module Pipeline on Spheron H100s
Concrete example: 5-module pipeline, 100 training examples, MIPROv2 with num_trials=50.
Inference call estimate:
- Proposal phase: 5 modules × 20 candidate instructions × 10 evaluation examples each = 1,000 calls
- Evaluation phase: 50 trials × 100 training examples × 5 modules = 25,000 calls
- Bootstrap demos: 5 modules × 50 bootstrapped traces = 250 calls
- Total: ~26,000 inference calls
Token estimate:
- Average 2,000 tokens per call (includes RAG context)
- Total: 52M tokens
Cost on Spheron H100 PCIe ($2.01/GPU-hr on-demand):
- Llama 4 Scout throughput: ~2,000 tokens/sec on H100 PCIe (INT4)
- GPU time: 52M / (2,000 × 3,600) = ~7.2 GPU-hours
- On-demand cost: 7.2 × $2.01 = $14.47
Equivalent GPT-4o API cost (at ~$10/1M tokens blended rate):
- 52M tokens × $10/1M = ~$520
The 36x cost difference is why self-hosted inference makes DSPy's MIPROv2 practical for production optimization loops, not just one-time experiments.
Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
DSPy turns optimization from a weekend manual exercise into a systematic compilation pass. The bottleneck is inference throughput during the MIPROv2 search, which is exactly where Spheron spot GPUs pay off.
Rent H100 → | Rent H200 → | Rent A100 → | View all pricing →
