Engineering

Agent Memory Infrastructure on GPU Cloud: Deploy Mem0, Zep, and Persistent Vector Memory for Production AI Agents (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 24, 2026
AI Agent MemoryMem0 DeploymentZep Memory ServerPersistent LLM Memory GPULettaMemGPTLangGraph MemorySelf-Hosted Agent MemoryVector StoreGPU CloudEmbedding ModelsL40S
Agent Memory Infrastructure on GPU Cloud: Deploy Mem0, Zep, and Persistent Vector Memory for Production AI Agents (2026)

Stateless agents forget everything the moment a session ends. At a single-user scale, that is annoying. At a thousand users, it makes personalization impossible and forces every prompt to carry full context, ballooning token costs. The GPU infrastructure for AI agents guide covers the compute stack broadly, and the scale AI agent fleets guide covers orchestration at high concurrency. This post focuses on the memory tier specifically: what Mem0, Zep, and Letta actually need from your GPU, and how to deploy them on Spheron.

Why Agent Memory Became a First-Class Infrastructure Problem in 2026

A year ago, most agent demos were single-session. The agent got a task, ran to completion, and was done. No history needed.

Production agents in 2026 look different. A personal assistant agent that handles email triage needs to know what you have asked it to ignore before. A coding agent needs to remember that your project uses Python 3.12 and a specific linting config. A customer service agent should not ask for an order number that was provided two sessions ago.

These requirements push memory out of the prompt and into a separate infrastructure layer. You cannot fit a user's full history in context. You need a system that stores memories compactly, retrieves the relevant ones at query time, and keeps the retrieval latency low enough to not break the agent's response budget.

Three operations drive the GPU cost in a memory system:

  1. Embedding writes: every new memory fact gets converted to a vector and written to the store. This runs on GPU because the throughput gap vs. CPU is 50-100x at the token level.
  2. Reranked retrieval reads: at query time, candidate memories are fetched by vector similarity, then re-scored by a reranker model to surface the most relevant ones. The reranker is another GPU-resident model.
  3. LLM-based summarization and extraction: systems like Zep run an LLM over raw conversation transcripts to extract structured facts and entities. This is the highest VRAM cost in the stack.

Three Types of Memory That Production Agents Need

Episodic memory stores specific past events. "User ran into an OOM error with Llama 70B on the A100 last session." "User asked to draft a blog post on inference optimization on March 14." These are stored as vector embeddings and retrieved by semantic similarity to the current query. Most memory systems support this.

Semantic memory stores persistent facts and preferences. "User works at a fintech company." "User prefers TypeScript." "User's deployment target is Kubernetes 1.29 on GKE." An LLM extracts these from conversation transcripts and stores them as structured key-value pairs or graph nodes. Zep's knowledge graph layer handles this well. Mem0 can do it with optional LLM-based extraction.

Procedural memory stores learned workflows and successful patterns. "When asked to debug Python, start by checking import errors." Most frameworks handle this through fine-tuned system prompts or explicit few-shot examples rather than runtime retrieval. Letta's core memory block is one way to persist this kind of operational context across sessions.

The GPU requirements differ per type. Episodic memory needs fast, high-throughput embedding. Semantic memory extraction needs a capable summarization LLM. Procedural memory is mostly static and doesn't add GPU cost at inference time.

The 2026 Agent Memory Landscape

SystemMemory approachGPU components requiredVector storeGraph store
Mem0Extracts facts from conversationsEmbedding model (required), LLM extractor (optional)Qdrant, Chroma, PineconeNo
Zep CETemporal knowledge graph + vectorEmbedding model (required), Summarization LLM (required)pgvectorNeo4j-compatible
Letta (MemGPT)Core memory + archival retrievalEmbedding model (required), LLM reasoning (required)Vector storeNo
LangGraph MemoryCheckpointing + store-backedEmbedding model (optional)PostgreSQL, RedisNo
MemGPT variantsIn-context pagingEmbedding model (required), LLM (required)AnyNo

When to use each: Mem0 is the lowest-friction starting point. If you want vector-based memory retrieval with optional fact extraction and your team does not need time-aware queries, start here. Zep CE is the right choice when your agent needs to answer questions about what happened "last Tuesday" or "three sessions ago" because its knowledge graph is time-indexed. Letta is better suited for agents where the reasoning model itself needs to manage what is in its active context window. LangGraph Memory works well if you are already on LangGraph and want lightweight checkpointing without a separate memory service.

GPU-Backed Components in a Memory Stack

Embedding Model

Every memory write goes through an embedding model. At production scale, this is the highest-frequency GPU operation in your memory stack. The model needs to run fast enough that individual write calls don't queue.

ModelParamsVRAM (FP16)Notes
BGE-M30.57B~1 GBMultilingual, hybrid search support
Qwen3-Embedding-0.6B0.6B~1.5 GBStrong small-model baseline as of Apr 2026
Qwen3-Embedding-4B4B~8 GBHigh quality, multilingual
E5-Mistral-7B-Instruct7B~14 GBBest quality, high VRAM cost

BGE-M3 is the practical default: 1GB VRAM, handles multilingual input, and supports hybrid dense+sparse retrieval out of the box. For teams running Qwen-family models, the Qwen3-Embedding-4B gives a meaningful quality bump when the memory vector space needs to handle ambiguous or technical queries. For the full TEI deployment setup, see the self-host embedding and rerankers guide.

Reranker

The reranker runs on the memory read path. After vector search returns 50 candidate memories, the reranker scores each against the query and returns the top 5. Without a reranker, approximate nearest-neighbor search produces false positives that contaminate the agent's context with irrelevant facts.

ModelVRAM (FP16)Latency per batch (50 pairs)Notes
BGE-reranker-v2-m3~1.5 GB~20ms on L40SBest quality-to-VRAM ratio
BGE-reranker-v2-minicpm-layerwise~2.5 GB~35msLayer-wise for speed
bge-reranker-large~1.3 GB~15msFaster, slightly lower quality

Run the reranker as a second TEI instance on port 8081. The VRAM cost is low enough that it fits alongside the embedding model and summarizer on any 24GB+ GPU without meaningful contention.

Summarization LLM

Zep CE requires one. Mem0 uses one for higher-quality fact extraction. The model processes raw conversation transcripts and outputs structured facts: entities, relationships, timestamps. Quality matters here more than throughput since the extraction runs asynchronously, not in the hot path.

ModelVRAM (FP8)Memory ops/sec (approx)Notes
Qwen3-8B~8 GB~50Solid baseline for Zep extraction
Llama-3.1-8B-Instruct~8 GB~50Reliable, well-tested
Llama-4-Scout (17B active / 109B total MoE)~109 GB FP8 / ~55 GB INT4~30Needs H100/H200 or multi-GPU setup; not viable on L40S or RTX PRO 6000 at FP8
Qwen3-32B~32 GB~15High-quality extraction, needs bigger GPU

For most teams, Qwen3-8B at FP8 is the right call. It handles conversation summarization cleanly, uses 8GB VRAM at FP8, and leaves 38GB free on an L40S for the embedding server, reranker, and future headroom. Step up to a 32B model only if your agent handles complex multi-entity domains where the 8B extraction quality is producing noisy memories. Note that Llama-4-Scout, despite the "17B active" label, is a 109B-total MoE model: all 109B weights load into VRAM, which puts it out of range for L40S or RTX PRO 6000 at FP8.

Architecture: Memory Extractor + Vector Store + Graph Store on Spheron

All GPU components run on one Spheron GPU node. The vector store and graph store run as Docker containers on the same node using CPU and NVMe.

Write path:

Agent conversation turn
  -> Memory extraction service (summarization LLM on GPU, async)
    -> New facts extracted as structured objects
      -> Embedding service (TEI on GPU) converts each fact to a vector
        -> Vector store write (Qdrant/Chroma, CPU + NVMe)
        -> Graph store write (pgvector or Neo4j, CPU, Zep only)

Read path:

User query (current turn)
  -> Embed query via TEI (GPU, ~2ms)
    -> Vector search: top-50 candidates (Qdrant/Chroma, CPU, ~3ms)
      -> Reranker: score 50 pairs, return top-5 (GPU, ~20ms)
        -> Top-5 memories injected into agent context

Total retrieval latency on the same node: 25-30ms. Over a network to managed vector search: 100-400ms. Colocation cuts 90% of the read latency.

The extraction write path runs async because users don't wait for it. The agent fires a background task after each conversation turn. Extraction latency of 0.5-2 seconds per turn has no impact on user-facing response time.

Deploying Mem0 with a Self-Hosted Embedding Model on Spheron

Prerequisites

  • L40S instance on Spheron (SSH access confirmed)
  • Docker installed (curl -fsSL https://get.docker.com | sh)
  • Qdrant running locally

Start Qdrant:

bash
docker run -d -p 6333:6333 -p 6334:6334 \
  -v qdrant_storage:/qdrant/storage \
  qdrant/qdrant

Or use Docker Compose to bring up both TEI and Qdrant together:

yaml
version: "3.8"
services:
  tei-embed:
    image: ghcr.io/huggingface/text-embeddings-inference:cuda-1.9
    ports:
      - "8080:80"
    command: --model-id BAAI/bge-m3 --max-batch-tokens 65536
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    volumes:
      - qdrant_storage:/qdrant/storage

volumes:
  qdrant_storage:

Start with: docker compose up -d

Verify both are healthy:

bash
curl localhost:8080/health   # TEI embedding server
curl localhost:6333/healthz  # Qdrant

Configure Mem0

python
from mem0 import Memory

config = {
    "embedder": {
        "provider": "huggingface",
        "config": {
            "huggingface_base_url": "http://localhost:8080"
        }
    },
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "host": "localhost",
            "port": 6333,
            "collection_name": "agent_memory"
        }
    }
}

m = Memory.from_config(config)

Test memory round-trip

python
# Write a memory
m.add("User prefers Rust for systems code and Python for ML scripts", user_id="u-001")

# Retrieve relevant memories
results = m.search("What language does this user prefer for ML?", user_id="u-001")
for r in results:
    print(r["memory"], r["score"])

Expected output: the Rust/Python preference fact should surface with a score above 0.85.

The Mem0 process itself is CPU-bound. It calls your local TEI endpoint for embeddings and Qdrant for storage. No GPU process runs inside Mem0. Only TEI uses the GPU.

Mem0 cloud vs. self-hosted: Mem0 offers a managed cloud tier. Self-hosting gives you data control, eliminates per-API-call embedding costs, and lets you swap the embedding model without changing application code. At 50M+ embedding tokens per month, self-hosting on an L40S is meaningfully cheaper.

Deploying Zep CE with a Summarization LLM (Qwen 3 or Llama 4)

Zep Community Edition requires an LLM for its knowledge graph extraction step. It calls the LLM over a standard OpenAI-compatible API, so any vLLM instance works.

Deploy the summarization LLM

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-8B \
  --dtype fp8 \
  --gpu-memory-utilization 0.45

The 0.45 utilization cap reserves VRAM for the TEI embedding and reranker containers running on the same GPU. Verify the model is loaded:

bash
curl localhost:8000/v1/models

Configure Zep CE

Clone the Zep CE repository and edit the .env file:

bash
git clone https://github.com/getzep/zep
cd zep
cp .env.example .env

Set these variables in .env:

bash
ZEP_LLM_PROVIDER=openai_compat
ZEP_LLM_BASE_URL=http://host.docker.internal:8000/v1
ZEP_LLM_MODEL=Qwen/Qwen3-8B
ZEP_EMBEDDINGS_SERVICE_URL=http://host.docker.internal:8080

Zep CE's docker-compose exposes port 8000 by default, which conflicts with the vLLM container already bound to that port. Before starting Zep, open docker-compose.yml and remap the Zep service's host port to 8003:

yaml
# in docker-compose.yml, under the zep service ports:
ports:
  - "8003:8000"
extra_hosts:
  - "host.docker.internal:host-gateway"

The extra_hosts entry is required on Linux. Without it, localhost inside the Zep container resolves to the container's own loopback, not the host machine where vLLM and TEI are running. host.docker.internal with the host-gateway entry maps correctly to the host.

Then start Zep CE:

bash
docker compose up -d

Test Zep memory extraction

python
from zep_python import ZepClient, Message

client = ZepClient(base_url="http://localhost:8003")

session_id = "session-001"
messages = [
    Message(role="user", content="I'm building an agent that deploys ML models to Kubernetes"),
    Message(role="assistant", content="Got it. What's your current Kubernetes version?"),
    Message(role="user", content="We're on 1.29, running on GKE in us-central1")
]

client.memory.add_memory(session_id=session_id, messages=messages)

# Give Zep a moment to run extraction (async)
import time
time.sleep(3)

memory = client.memory.get_memory(session_id=session_id)
print(memory.facts)

Zep will extract facts like "user is deploying ML models to Kubernetes 1.29 on GKE in us-central1" and store them in its knowledge graph. Future queries for context about this user will surface these structured facts.

The Zep CE extraction loop runs on your vLLM instance. Each session's message batch triggers one LLM call. At 500 sessions per day with 10 messages per session, you get 500 extraction calls: roughly 250 seconds of GPU time at 0.5 seconds per call on Qwen3-8B FP8.

GPU Sizing and Cost Math

GPUVRAMOn-DemandSpotWhat memory stack it runs
L40S PCIe48 GB$0.72/hrN/AEmbedding + reranker + 8B summarizer (Mem0 or Zep)
RTX PRO 600096 GB$1.70/hr$0.59/hrFull stack + 32B summarizer + medium inference model
A100 80GB PCIe80 GB$1.04/hr$1.14/hr*Full stack + 30B summarizer or 70B inference co-located

Pricing fluctuates based on GPU availability. The prices above are based on 24 Apr 2026 and may have changed. Check current GPU pricing for live rates.

\A100 80GB PCIe spot price is currently higher than on-demand due to capacity constraints. Spot pricing can invert like this when spot availability is tight. Check live pricing before assuming spot is cheaper.*

Memory operations cost math for 100 active users:

  • User activity: 100 users, 50 messages each per day = 5,000 messages
  • Memory writes: 1 embedding call per message = 5,000 embedding calls/day
  • Memory reads: 5 searches/day per user = 500 retrieval calls/day
  • Zep extraction: every 10 messages triggers one LLM call = 500 LLM calls/day

BGE-M3 on an L40S handles roughly 10,000 embeddings/second at batch size 64 for short text inputs (under 128 tokens). On longer memory facts, throughput drops proportionally. All 5,000 daily writes still take well under a minute total at typical memory content lengths. The reranker adds ~20ms per read query: 500 queries * 20ms = 10 seconds of GPU time per day.

The bottleneck is summarization. 500 Zep extraction calls at 0.5 seconds each = 250 seconds of GPU time per day.

Total: 250 seconds / 3600 = 0.069 GPU-hours/day. At L40S on-demand pricing of $0.72/hr, that is under $0.05/day for 100 users' worth of memory operations. The memory stack is not where your GPU bill comes from. Your inference model is.

When memory operations and inference run on the same node, the embedding and reranker containers use a fraction of the VRAM while your primary LLM fills the rest. Rent an L40S on Spheron and run both the memory stack and an 8B inference model on the same card.

Integrating Agent Memory with MCP and Multi-Agent Orchestration

Memory is a cross-cutting concern in agent architectures. Three patterns for wiring it in:

Pattern 1: Memory as MCP tools. Expose Mem0 or Zep as MCP tools (memory_search, memory_add, memory_update). The MCP server process runs on CPU. It calls your Mem0 or Zep HTTP API, which in turn calls the GPU-resident embedding server. From the agent's perspective, memory is just another tool call. The MCP server GPU deployment guide covers the wrapping pattern in detail.

Pattern 2: LangGraph integration. Mem0 has a LangGraph-compatible adapter that plugs into the BaseStore interface. Zep has a LangChain memory class. Both let you wire persistent memory into a LangGraph graph node without restructuring your agent logic. The memory node runs after each user turn and before the next planning step.

Pattern 3: Shared memory pool for multi-agent systems. Multiple agents write to and read from the same Mem0 or Zep instance, keyed by user_id or session_id. Agent A can write a memory about a user preference; Agent B can read it on the next turn without any direct inter-agent communication. See multi-agent AI system GPU infrastructure for the broader orchestration architecture.

When memory retrieval and RAG retrieval coexist in the same agent, they can share a single embedding server. The TEI instance handles both the document embedding pipeline and the memory write pipeline, since both call the same /embed endpoint. See the agentic RAG GPU infrastructure guide for how to co-locate the retrieval stack with a memory layer: the embedding server serves both workloads, and the vector stores are just two separate collections.


Agent memory is the GPU workload that most teams underestimate until they try to scale it. An L40S on Spheron runs the full Mem0 or Zep stack - embedding server, reranker, and 8B summarizer - for under $1/hr on-demand. For teams co-locating memory infrastructure with their primary inference model, the RTX PRO 6000 offers 96GB GDDR7 to fit both workloads on one node.

Rent L40S on Spheron → | Rent RTX PRO 6000 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.