Tutorial

DSPy on GPU Cloud: Self-Optimizing LLM Pipelines in Production (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 25, 2026
DSPy Production DeploymentDSPyMIPROv2BootstrapFewShotSelf-Optimizing LLM PipelinesAutomatic Prompt OptimizationDSPy vs LangChainLLM Pipeline OptimizationDSPy GPU CloudGPU Cloud
DSPy on GPU Cloud: Self-Optimizing LLM Pipelines in Production (2026)

Prompt engineering does not scale. You tune the retrieval prompt, the answer prompt breaks. You fix the answer prompt, the citation prompt regresses. In a compound pipeline with five or more modules, manual tuning is a never-ending whack-a-mole game with no systematic exit condition.

DSPy gives you the exit condition: treat the pipeline as a program, define a metric, and let the optimizer find the prompts automatically.

Why Compound AI Systems Break Prompt Engineering

A single-module LLM call is easy to tune. You write a prompt, run it on 20 examples, adjust the wording, and ship. The feedback loop is short enough to manage by hand.

Compound pipelines change the math. A RAG system with a query rewriter, retriever, context ranker, answer generator, and citation verifier has at least five prompts that interact. Improving the query rewriter changes what the retriever sees, which changes what the context ranker scores, which changes what the answer generator receives. You cannot optimize one module without touching the others.

The combinatorial surface is also enormous. If each of five modules has 10 reasonable prompt candidates, the search space is 10^5 configurations. Systematic search through that space by hand is not practical.

DSPy's thesis is that LLM pipeline construction is a compilation and optimization problem, not a manual engineering problem. You define typed Signatures (what goes in, what comes out), assemble them into Modules, write a metric function, and let an optimizer like MIPROv2 search over prompt candidates and few-shot demonstrations end-to-end.

DSPy 3.x Core Primitives

DSPy 3.x reorganized its API around three concepts that map cleanly onto the compile-and-optimize mental model.

Signatures

A Signature is a typed input/output contract for a single LLM call. It specifies what fields go in, what fields come out, and a natural-language description for each field that seeds the prompt template.

python
import dspy

class QuestionToContext(dspy.Signature):
    """Rewrite the user question as a precise search query."""
    question: str = dspy.InputField(desc="The original user question")
    search_query: str = dspy.OutputField(desc="A precise search query for retrieval")

class ContextToAnswer(dspy.Signature):
    """Answer the question using only the provided context."""
    question: str = dspy.InputField(desc="The original user question")
    context: str = dspy.InputField(desc="Retrieved passages from the knowledge base")
    answer: str = dspy.OutputField(desc="A concise, factual answer citing the context")
    citations: list[str] = dspy.OutputField(desc="List of source passage IDs used")

The field descriptions are seed prompt text. The optimizer treats them as starting points and searches for better instruction text during compilation.

Modules

A Module is a composable unit that wraps one or more Signatures and defines how they chain together. DSPy ships three built-in wrappers:

  • dspy.Predict: direct LLM call, no chain-of-thought
  • dspy.ChainOfThought: adds a reasoning field before the output
  • dspy.ReAct: tool-augmented agent loop

Modules compose naturally: a Module can contain other Modules, and the optimizer traces through the full graph.

python
class RAGPipeline(dspy.Module):
    def __init__(self, retrieve_fn):
        super().__init__()
        self.retrieve = retrieve_fn
        self.rewrite = dspy.ChainOfThought(QuestionToContext)
        self.answer = dspy.ChainOfThought(ContextToAnswer)

    def forward(self, question: str) -> dspy.Prediction:
        rewritten = self.rewrite(question=question)
        passages = self.retrieve(rewritten.search_query, k=5)
        context = "\n\n".join(passages)
        return self.answer(question=question, context=context)

Optimizers: BootstrapFewShot, COPRO, and MIPROv2

DSPy ships three main optimizers with different tradeoffs:

OptimizerOptimizesTraining data neededCompute costWhen to use
BootstrapFewShotDemonstrations (few-shot examples)20+ examplesLowQuick wins, limited data
COPROInstructions (prompt text)50+ examplesMediumWhen instruction quality is the bottleneck
MIPROv2Instructions + demonstrations jointly100+ examplesHighBest overall quality, sufficient data

BootstrapFewShot generates few-shot demonstrations by running the pipeline on training examples and keeping correct traces. It is fast, requires minimal data, and often gives a 5-15% metric improvement with no hyperparameter tuning.

COPRO uses a meta-LLM to propose and evaluate alternative instruction text for each module. It is slower than BootstrapFewShot but produces cleaner prompts that generalize better to out-of-distribution inputs.

MIPROv2 (Multi-prompt Instruction PRoposal Optimizer v2) combines both approaches. It runs a Bayesian search over a joint space of instruction candidates and demonstration sets, using the metric function to score each configuration on a held-out dev set. For a well-specified pipeline with 100+ training examples, MIPROv2 consistently outperforms the other two optimizers.

Why Self-Hosted LLMs Are Non-Negotiable for Serious Optimization

Three concrete reasons self-hosting matters for DSPy optimization.

Cost. MIPROv2 with num_trials=50 on a 5-module pipeline burns 2,000-8,000 inference calls per compilation run. At commercial API rates for a 70B-class model, a single optimization run costs $50-$300 depending on average context length. At those prices, you iterate once and call it done. On a self-hosted H100 GPU rental on Spheron, the same run costs $8-$40 in GPU time. You can afford to iterate. For a concrete cost comparison, see the GPU cost-per-token benchmarks covering on-demand vs self-hosted inference economics.

Rate limits. Commercial APIs throttle high-throughput optimization loops. A 5-module pipeline evaluating 100 training examples across 50 trials generates bursts of hundreds of concurrent requests. Most commercial API tiers queue these requests, turning a 90-minute optimization run into 6+ hours of wall-clock time. Self-hosted endpoints have no per-account rate limits: the only throttle is GPU memory and compute.

Reproducibility. Self-hosted models with temperature=0 produce deterministic outputs. When comparing prompt candidate A against prompt candidate B, you need the inference engine to be a constant. Commercial APIs offer no determinism guarantees: sampling randomness, model version updates, and infrastructure-level changes all introduce noise into the comparison signal. Deterministic inference makes the optimizer's Bayesian search reliable.

Architecture: DSPy Compiler + Target Model + Judge Model on Spheron

The production DSPy setup has three components:

DSPy compiler script. Runs on any Python environment, including your laptop or a small CPU VM. It orchestrates the optimization loop: proposes instruction candidates, assembles configurations, evaluates them via the metric function, and updates the Bayesian search state. It makes inference calls to the target model endpoint but does not itself require a GPU.

Target model. A self-hosted LLM running on an H100 GPU rental on Spheron. This is the model being optimized: the one whose prompts and few-shot demos MIPROv2 is tuning. For most production pipelines, Llama 4 Scout or Qwen 3.6 Plus is the right starting point.

Judge model (optional but recommended). A separate LLM running on an A100 instance for the judge model, used as the metric function for open-ended generation tasks. The judge model must be separate from the target model to avoid self-preference bias (a model consistently rating its own outputs higher than a neutral judge would). For factual tasks with ground-truth labels, a Python metric function is sufficient and you can skip the judge model entirely.

The compiler script talks to both the target model and the judge model via OpenAI-compatible HTTP endpoints. Both endpoints can run on the same Spheron private network, keeping inference round-trip latency low.

Deploying Your DSPy Backend Models on Spheron

Llama 4 Scout as the Target Model

Llama 4 Scout is a 17B-active-parameter MoE model (109B total). All expert weights must reside in VRAM for routing, so the full model in BF16 requires ~218GB, well above a single GPU. With INT4 quantization (~55GB), it fits on a single H100 PCIe (80GB). For full deployment steps, see the Llama 4 GPU deployment guide. For the DSPy compiler, configure the endpoint with:

python
import dspy

# Point DSPy at your self-hosted vLLM endpoint
lm = dspy.LM(
    model="openai/meta-llama/Llama-4-Scout-17B-16E-Instruct",
    api_base="http://<your-spheron-instance-ip>:8000/v1",
    api_key="none",
    temperature=0.0,  # deterministic for optimizer reliability
    max_tokens=1024,
)
dspy.configure(lm=lm)

Llama 4 Scout's strong instruction-following performance makes it a good optimization target: MIPROv2's proposed instruction candidates are diverse enough to stress-test the model's ability to follow precise formatting constraints, and Scout handles this well.

Qwen 3.6 Plus for Instruction-Following Tasks

Qwen 3.6 Plus is particularly strong at following complex, multi-constraint instructions, which is valuable when COPRO or MIPROv2 generates elaborate prompt candidates with multiple output format requirements. See the Qwen 3.6 Plus deployment guide for vLLM setup. The DSPy configuration is identical in structure:

python
lm = dspy.LM(
    model="openai/Qwen/Qwen3.6Plus",
    api_base="http://<your-spheron-instance-ip>:8000/v1",
    api_key="none",
    temperature=0.0,
    max_tokens=1024,
)
dspy.configure(lm=lm)

GPT-OSS 120B for Highest Accuracy Pipelines

GPT-OSS 120B is the highest-capacity open-weight option in this comparison. Its MoE architecture keeps active parameter count around 5.1B per forward pass, which means it fits on a single H100 PCIe at MXFP4. For tasks where answer accuracy is the dominant metric and you have the GPU budget for it, GPT-OSS 120B as the optimization target typically yields the highest post-optimization scores. See the GPT-OSS deployment guide for setup. For serving at full capacity or with larger context windows, an H200 GPU rental gives more headroom.

Running MIPROv2 on a RAG Pipeline: Full Walkthrough

Dataset and Metric Setup

MIPROv2 needs a labeled training set and a metric function that returns a float between 0 and 1. For a RAG pipeline, each training example needs a question and an expected answer.

python
import dspy
from datasets import load_dataset

# Load HotpotQA as a quick starting dataset
hotpot = load_dataset("hotpot_qa", "fullwiki", split="train[:200]")

trainset = [
    dspy.Example(
        question=row["question"],
        expected_answer=row["answer"]
    ).with_inputs("question")
    for row in hotpot
]
devset = trainset[:50]
trainset = trainset[50:]

# LLM-as-judge metric (see /blog/llm-as-judge-evaluation-pipeline-gpu-cloud/ for full setup)
judge_lm = dspy.LM(
    model="openai/meta-llama/Llama-4-Scout-17B-16E-Instruct",
    api_base="http://<judge-instance-ip>:8000/v1",
    api_key="none",
    temperature=0.0,
)

def answer_correctness_metric(example, prediction, trace=None):
    """Score answer correctness with an LLM judge. Returns float 0-1."""
    with dspy.context(lm=judge_lm):
        verdict = dspy.Predict("question, gold_answer, predicted_answer -> correct: bool")(
            question=example.question,
            gold_answer=example.expected_answer,
            predicted_answer=prediction.answer,
        )
    return float(verdict.correct)

For building the full judge pipeline, the guide on LLM-as-judge evaluation on GPU cloud covers rubric design, bias mitigation, and vLLM deployment for the judge endpoint.

Defining the RAG Module

python
import dspy
import faiss
import numpy as np

class QueryRewriter(dspy.Signature):
    """Rewrite the question as a concise search query."""
    question: str = dspy.InputField()
    search_query: str = dspy.OutputField()

class AnswerWithContext(dspy.Signature):
    """Answer using only the retrieved context. Be concise and factual."""
    question: str = dspy.InputField()
    context: str = dspy.InputField(desc="Retrieved passages, one per line")
    answer: str = dspy.OutputField(desc="One to three sentence answer")
    citations: list[str] = dspy.OutputField(desc="List of source passage IDs used")

class RAGPipeline(dspy.Module):
    def __init__(self, index, passages, embed_fn):
        super().__init__()
        self.index = index
        self.passages = passages
        self.embed = embed_fn
        self.rewrite = dspy.ChainOfThought(QueryRewriter)
        self.answer = dspy.ChainOfThought(AnswerWithContext)

    def forward(self, question: str) -> dspy.Prediction:
        rewritten = self.rewrite(question=question)
        query_vec = self.embed(rewritten.search_query)
        _, ids = self.index.search(np.array([query_vec]), k=5)
        context = "\n".join(self.passages[i] for i in ids[0] if i >= 0)
        return self.answer(question=question, context=context)

rag = RAGPipeline(index=faiss_index, passages=passage_list, embed_fn=embed)

Running the Optimizer

python
optimizer = dspy.MIPROv2(
    metric=answer_correctness_metric,
    auto="medium",          # controls num_candidates and num_trials automatically
    num_threads=8,          # parallel eval threads (tune to your GPU throughput)
    verbose=True,
)

compiled_rag = optimizer.compile(
    rag,
    trainset=trainset,
    num_trials=50,          # Bayesian search iterations
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    eval_kwargs={"num_threads": 8, "display_progress": True},
)

compiled_rag.save("optimized_rag_v1.json")

With Llama 4 Scout as the target model on an H100 PCIe (INT4), a 2-module pipeline with 100 training examples and num_trials=50 takes roughly 45-90 minutes wall-clock, depending on average response length. A 5-module pipeline scales to 3-6 hours.

Interpreting Optimization Results

After compilation, inspect the optimized prompts and demonstrations:

python
# View the optimized instruction for the query rewriter
print(compiled_rag.rewrite.extended_signature.instructions)

# View bootstrapped few-shot examples
for demo in compiled_rag.rewrite.demos:
    print(f"Q: {demo.question}")
    print(f"Query: {demo.search_query}\n")

# Evaluate on dev set
from dspy.evaluate import Evaluate
evaluate = Evaluate(devset=devset, metric=answer_correctness_metric, num_threads=8)
baseline_score = evaluate(rag)
optimized_score = evaluate(compiled_rag)
print(f"Baseline: {baseline_score:.1%}")
print(f"Optimized: {optimized_score:.1%}")

In practice, MIPROv2 consistently delivers 10-20 percentage point metric improvements on RAG pipelines with 100+ training examples. A typical result: RAG answer correctness from 61.2% to 74.8% on HotpotQA after MIPROv2 with Llama 4 Scout as both target and judge, with the optimized prompts emphasizing evidence citation and direct factual answers over elaboration.

Cost Breakdown: Self-Hosted vs Commercial API

Running MIPROv2 on a 2-module RAG pipeline with 100 training examples and num_trials=50:

ApproachInference callsAvg tokens/callTotal tokensApprox costWall-clock time
GPT-4o API~5,0002,00010M$50-$1003-8 hrs (rate limits)
Claude 3.5 Sonnet API~5,0002,00010M$40-$903-8 hrs (rate limits)
Self-hosted Llama 4 Scout, H100 PCIe ($2.01/hr)~5,0002,00010M~$2-$545-90 min
Self-hosted Qwen 3.6 Plus, H100 PCIe ($2.01/hr)~5,0002,00010M~$5-$1060-120 min

Self-hosting reduces cost by 6-10x and cuts wall-clock time by 3-5x by eliminating API rate-limit queuing.

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Runtime Constraints and Metric-Driven Refinement

DSPy 3.x replaces the deprecated dspy.Assert and dspy.Suggest APIs (removed in 3.x) with two module wrappers: dspy.Refine for hard constraints and dspy.BestOfN for soft quality scoring.

dspy.Refine wraps a module and retries the forward call until the reward function is satisfied or the attempt limit is reached. dspy.BestOfN runs the module N times and returns the output with the highest reward score. Both take a reward_fn that maps inputs and a prediction to a float between 0 and 1.

python
import dspy
import numpy as np

def query_conciseness_reward(args, prediction, trace=None):
    """Returns 1.0 if the search query is no longer than the input question."""
    question = args.get("question", "")
    return float(len(prediction.search_query.split()) <= len(question.split()))

def answer_conciseness_reward(args, prediction, trace=None):
    """Returns 1.0 if the answer is under 100 words."""
    return float(len(prediction.answer.split()) < 100)

class RAGWithConstraints(dspy.Module):
    def __init__(self, index, passages, embed_fn):
        super().__init__()
        self.index = index
        self.passages = passages
        self.embed = embed_fn
        self.rewrite = dspy.Refine(
            module=dspy.ChainOfThought(QueryRewriter),
            N=3,
            reward_fn=query_conciseness_reward,
        )
        self.answer = dspy.BestOfN(
            module=dspy.ChainOfThought(AnswerWithContext),
            N=3,
            reward_fn=answer_conciseness_reward,
        )

    def forward(self, question: str) -> dspy.Prediction:
        rewritten = self.rewrite(question=question)
        query_vec = self.embed(rewritten.search_query)
        _, ids = self.index.search(np.array([query_vec]), k=5)
        context = "\n".join(self.passages[i] for i in ids[0] if i >= 0)
        return self.answer(question=question, context=context)

Use dspy.Refine when a constraint is structural and must be satisfied before the result is usable. Use dspy.BestOfN when any output is acceptable but you want the highest-scoring one from multiple samples.

For building the judge metric that feeds reward scoring, the guide on LLM-as-judge evaluation pipelines on GPU cloud covers the full evaluation infrastructure.

DSPy vs LangChain vs LlamaIndex: When Each Wins

CapabilityDSPyLangChainLlamaIndex
Manual prompt controlLow (abstracted away)HighMedium
Automatic prompt optimizationYes (core feature)NoLimited
Built-in RAG abstractionsBasicExtensiveExtensive
Multi-agent orchestrationVia dspy.ReActVia LangGraphVia agent workflows
Learning curveMedium-highMediumMedium
Best forSystematic optimization of compound pipelinesFlexible chaining with large ecosystemData ingestion, indexing, and retrieval

DSPy is not a replacement for LangChain or LlamaIndex. Use DSPy when you have a compound pipeline with a measurable metric and you need systematic optimization. Use LangChain when you need flexibility, a large third-party integration ecosystem, or established agent orchestration patterns. Use LlamaIndex when data ingestion and retrieval quality are the main concern.

Many production teams use all three: LlamaIndex for the data pipeline, LangChain for orchestration, and DSPy to optimize the prompts within each module.

Production Deployment Patterns

Compiled Program Artifacts and Versioning

DSPy programs serialize to JSON. Treat these files as model artifacts and version them alongside the base model checkpoint.

python
# Save with metadata
compiled_rag.save("optimized_rag_v1.json")

# Recommended naming: model, date, dev-set metric score
# e.g., optimized_rag_llama4scout_20260423_748.json
# (748 = 74.8% dev metric)

Load in production:

python
rag = RAGPipeline(index=index, passages=passages, embed_fn=embed)
rag.load("optimized_rag_llama4scout_20260423_748.json")

Store versioned artifacts in S3 or a model registry. When the base model is updated or fine-tuned, the compiled artifact needs reoptimization against the new checkpoint since the frozen prompts were tuned to the previous model's behavior.

Serving a Compiled DSPy Program

A compiled DSPy program is a stateless callable. At serve time, it makes inference calls to the target LLM endpoint and nothing else. No optimizer dependencies are needed at inference time.

Wrap it in FastAPI:

python
import asyncio

from fastapi import FastAPI
from pydantic import BaseModel
import dspy

app = FastAPI()

class QuestionRequest(BaseModel):
    question: str

# Load once at startup
lm = dspy.LM(model="openai/meta-llama/Llama-4-Scout-17B-16E-Instruct",
             api_base="http://llm-endpoint:8000/v1", api_key="none")
dspy.configure(lm=lm)
rag = RAGPipeline(index=index, passages=passages, embed_fn=embed)
rag.load("optimized_rag_llama4scout_20260423_748.json")

@app.post("/answer")
async def answer(request: QuestionRequest):
    # Wrap in dspy.context so each thread gets its own LM reference,
    # preventing bleed from concurrent dspy.context() overrides in other threads.
    def run_rag():
        with dspy.context(lm=lm):
            return rag(question=request.question)
    result = await asyncio.to_thread(run_rag)
    return {"answer": result.answer, "citations": result.citations}

For multi-model pipelines or queue-depth autoscaling, wrap the compiled program in a Ray Serve deployment instead. The compiled DSPy program maps cleanly onto a Ray Serve Deployment class: it is stateless, CPU-bound at serve time, and its only external dependency is the vLLM endpoint.

Drift Monitoring and Re-Optimization

Production LLM responses drift as models are updated or fine-tuned. Monitor your production metric in a sliding window.

  1. Route 1-2% of production queries through the same LLM-as-judge metric used during optimization.
  2. Track the 7-day rolling average of the judge score.
  3. When the rolling metric drops 5%+ from the post-optimization baseline, trigger a re-optimization job on Spheron spot H100s.
  4. Load the current checkpoint's compiled artifact as the starting program and run MIPROv2 with the new training data collected since the last optimization.

Keep optimization runs on spot instances to minimize cost. The optimization script checkpoints to disk after each trial, so spot preemption only costs the current trial, not the full run.

End-to-End Cost: Optimizing a 5-Module Pipeline on Spheron H100s

Concrete example: 5-module pipeline, 100 training examples, MIPROv2 with num_trials=50.

Inference call estimate:

  • Proposal phase: 5 modules × 20 candidate instructions × 10 evaluation examples each = 1,000 calls
  • Evaluation phase: 50 trials × 100 training examples × 5 modules = 25,000 calls
  • Bootstrap demos: 5 modules × 50 bootstrapped traces = 250 calls
  • Total: ~26,000 inference calls

Token estimate:

  • Average 2,000 tokens per call (includes RAG context)
  • Total: 52M tokens

Cost on Spheron H100 PCIe ($2.01/GPU-hr on-demand):

  • Llama 4 Scout throughput: ~2,000 tokens/sec on H100 PCIe (INT4)
  • GPU time: 52M / (2,000 × 3,600) = ~7.2 GPU-hours
  • On-demand cost: 7.2 × $2.01 = $14.47

Equivalent GPT-4o API cost (at ~$10/1M tokens blended rate):

  • 52M tokens × $10/1M = ~$520

The 36x cost difference is why self-hosted inference makes DSPy's MIPROv2 practical for production optimization loops, not just one-time experiments.

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing → for live rates.


DSPy turns optimization from a weekend manual exercise into a systematic compilation pass. The bottleneck is inference throughput during the MIPROv2 search, which is exactly where Spheron spot GPUs pay off.

Rent H100 → | Rent H200 → | Rent A100 → | View all pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.