Engineering

Agent Memory Infrastructure on GPU Cloud: Deploy Mem0, Zep, and Persistent Vector Memory for Production AI Agents (2026)

AI Agent MemoryMem0 DeploymentZep Memory ServerPersistent LLM Memory GPULettaMemGPTLangGraph MemorySelf-Hosted Agent MemoryVector StoreGPU CloudEmbedding ModelsL40S
Agent Memory Infrastructure on GPU Cloud: Deploy Mem0, Zep, and Persistent Vector Memory for Production AI Agents (2026)

Stateless agents forget everything the moment a session ends. At a single-user scale, that is annoying. At a thousand users, it makes personalization impossible and forces every prompt to carry full context, ballooning token costs. The GPU infrastructure for AI agents guide covers the compute stack broadly, and the scale AI agent fleets guide covers orchestration at high concurrency. This post focuses on the memory tier specifically: what Mem0, Zep, and Letta actually need from your GPU, and how to deploy them on Spheron.

Why Agent Memory Became a First-Class Infrastructure Problem in 2026

A year ago, most agent demos were single-session. The agent got a task, ran to completion, and was done. No history needed.

Production agents in 2026 look different. A personal assistant agent that handles email triage needs to know what you have asked it to ignore before. A coding agent needs to remember that your project uses Python 3.12 and a specific linting config. A customer service agent should not ask for an order number that was provided two sessions ago.

These requirements push memory out of the prompt and into a separate infrastructure layer. You cannot fit a user's full history in context. You need a system that stores memories compactly, retrieves the relevant ones at query time, and keeps the retrieval latency low enough to not break the agent's response budget.

Three operations drive the GPU cost in a memory system:

  1. Embedding writes: every new memory fact gets converted to a vector and written to the store. This runs on GPU because the throughput gap vs. CPU is 50-100x at the token level.
  2. Reranked retrieval reads: at query time, candidate memories are fetched by vector similarity, then re-scored by a reranker model to surface the most relevant ones. The reranker is another GPU-resident model.
  3. LLM-based summarization and extraction: systems like Zep run an LLM over raw conversation transcripts to extract structured facts and entities. This is the highest VRAM cost in the stack.

Three Types of Memory That Production Agents Need

Episodic memory stores specific past events. "User ran into an OOM error with Llama 70B on the A100 last session." "User asked to draft a blog post on inference optimization on March 14." These are stored as vector embeddings and retrieved by semantic similarity to the current query. Most memory systems support this.

Semantic memory stores persistent facts and preferences. "User works at a fintech company." "User prefers TypeScript." "User's deployment target is Kubernetes 1.29 on GKE." An LLM extracts these from conversation transcripts and stores them as structured key-value pairs or graph nodes. Zep's knowledge graph layer handles this well. Mem0 can do it with optional LLM-based extraction.

Procedural memory stores learned workflows and successful patterns. "When asked to debug Python, start by checking import errors." Most frameworks handle this through fine-tuned system prompts or explicit few-shot examples rather than runtime retrieval. Letta's core memory block is one way to persist this kind of operational context across sessions.

The GPU requirements differ per type. Episodic memory needs fast, high-throughput embedding. Semantic memory extraction needs a capable summarization LLM. Procedural memory is mostly static and doesn't add GPU cost at inference time.

The 2026 Agent Memory Landscape

SystemMemory approachGPU components requiredVector storeGraph store
Mem0Extracts facts from conversationsEmbedding model (required), LLM extractor (optional)Qdrant, Chroma, PineconeNo
Zep CETemporal knowledge graph + vectorEmbedding model (required), Summarization LLM (required)pgvectorNeo4j-compatible
Letta (MemGPT)Core memory + archival retrievalEmbedding model (required), LLM reasoning (required)Vector storeNo
LangGraph MemoryCheckpointing + store-backedEmbedding model (optional)PostgreSQL, RedisNo
MemGPT variantsIn-context pagingEmbedding model (required), LLM (required)AnyNo

When to use each: Mem0 is the lowest-friction starting point. If you want vector-based memory retrieval with optional fact extraction and your team does not need time-aware queries, start here. Zep CE is the right choice when your agent needs to answer questions about what happened "last Tuesday" or "three sessions ago" because its knowledge graph is time-indexed. Letta is better suited for agents where the reasoning model itself needs to manage what is in its active context window. LangGraph Memory works well if you are already on LangGraph and want lightweight checkpointing without a separate memory service.

GPU-Backed Components in a Memory Stack

Embedding Model

Every memory write goes through an embedding model. At production scale, this is the highest-frequency GPU operation in your memory stack. The model needs to run fast enough that individual write calls don't queue.

ModelParamsVRAM (FP16)Notes
BGE-M30.57B~1 GBMultilingual, hybrid search support
Qwen3-Embedding-0.6B0.6B~1.5 GBStrong small-model baseline as of Apr 2026
Qwen3-Embedding-4B4B~8 GBHigh quality, multilingual
E5-Mistral-7B-Instruct7B~14 GBBest quality, high VRAM cost

BGE-M3 is the practical default: 1GB VRAM, handles multilingual input, and supports hybrid dense+sparse retrieval out of the box. For teams running Qwen-family models, the Qwen3-Embedding-4B gives a meaningful quality bump when the memory vector space needs to handle ambiguous or technical queries. For the full TEI deployment setup, see the self-host embedding and rerankers guide.

Reranker

The reranker runs on the memory read path. After vector search returns 50 candidate memories, the reranker scores each against the query and returns the top 5. Without a reranker, approximate nearest-neighbor search produces false positives that contaminate the agent's context with irrelevant facts.

ModelVRAM (FP16)Latency per batch (50 pairs)Notes
BGE-reranker-v2-m3~1.5 GB~20ms on L40SBest quality-to-VRAM ratio
BGE-reranker-v2-minicpm-layerwise~2.5 GB~35msLayer-wise for speed
bge-reranker-large~1.3 GB~15msFaster, slightly lower quality

Run the reranker as a second TEI instance on port 8081. The VRAM cost is low enough that it fits alongside the embedding model and summarizer on any 24GB+ GPU without meaningful contention.

Summarization LLM

Zep CE requires one. Mem0 uses one for higher-quality fact extraction. The model processes raw conversation transcripts and outputs structured facts: entities, relationships, timestamps. Quality matters here more than throughput since the extraction runs asynchronously, not in the hot path.

ModelVRAM (FP8)Memory ops/sec (approx)Notes
Qwen3-8B~8 GB~50Solid baseline for Zep extraction
Llama-3.1-8B-Instruct~8 GB~50Reliable, well-tested
Llama-4-Scout (17B active / 109B total MoE)~109 GB FP8 / ~55 GB INT4~30Needs H100/H200 or multi-GPU setup; not viable on L40S or RTX PRO 6000 at FP8
Qwen3-32B~32 GB~15High-quality extraction, needs bigger GPU

For most teams, Qwen3-8B at FP8 is the right call. It handles conversation summarization cleanly, uses 8GB VRAM at FP8, and leaves 38GB free on an L40S for the embedding server, reranker, and future headroom. Step up to a 32B model only if your agent handles complex multi-entity domains where the 8B extraction quality is producing noisy memories. Note that Llama-4-Scout, despite the "17B active" label, is a 109B-total MoE model: all 109B weights load into VRAM, which puts it out of range for L40S or RTX PRO 6000 at FP8.

Architecture: Memory Extractor + Vector Store + Graph Store on Spheron

All GPU components run on one Spheron GPU node. The vector store and graph store run as Docker containers on the same node using CPU and NVMe.

Write path:

Agent conversation turn
  -> Memory extraction service (summarization LLM on GPU, async)
    -> New facts extracted as structured objects
      -> Embedding service (TEI on GPU) converts each fact to a vector
        -> Vector store write (Qdrant/Chroma, CPU + NVMe)
        -> Graph store write (pgvector or Neo4j, CPU, Zep only)

Read path:

User query (current turn)
  -> Embed query via TEI (GPU, ~2ms)
    -> Vector search: top-50 candidates (Qdrant/Chroma, CPU, ~3ms)
      -> Reranker: score 50 pairs, return top-5 (GPU, ~20ms)
        -> Top-5 memories injected into agent context

Total retrieval latency on the same node: 25-30ms. Over a network to managed vector search: 100-400ms. Colocation cuts 90% of the read latency.

The extraction write path runs async because users don't wait for it. The agent fires a background task after each conversation turn. Extraction latency of 0.5-2 seconds per turn has no impact on user-facing response time. Agents with large persistent memory stores benefit especially from sleep-time compute, which pre-fills KV caches during idle periods so queries skip the expensive prefill step.

Deploying Mem0 with a Self-Hosted Embedding Model on Spheron

Prerequisites

  • L40S instance on Spheron (SSH access confirmed)
  • Docker installed (curl -fsSL https://get.docker.com | sh)
  • Qdrant running locally

Start Qdrant:

bash
docker run -d -p 6333:6333 -p 6334:6334 \
  -v qdrant_storage:/qdrant/storage \
  qdrant/qdrant

Or use Docker Compose to bring up both TEI and Qdrant together:

yaml
version: "3.8"
services:
  tei-embed:
    image: ghcr.io/huggingface/text-embeddings-inference:cuda-1.9
    ports:
      - "8080:80"
    command: --model-id BAAI/bge-m3 --max-batch-tokens 65536
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    volumes:
      - qdrant_storage:/qdrant/storage

volumes:
  qdrant_storage:

Start with: docker compose up -d

Verify both are healthy:

bash
curl localhost:8080/health   # TEI embedding server
curl localhost:6333/healthz  # Qdrant

Configure Mem0

python
from mem0 import Memory

config = {
    "embedder": {
        "provider": "huggingface",
        "config": {
            "huggingface_base_url": "http://localhost:8080"
        }
    },
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "host": "localhost",
            "port": 6333,
            "collection_name": "agent_memory"
        }
    }
}

m = Memory.from_config(config)

Test memory round-trip

python
# Write a memory
m.add("User prefers Rust for systems code and Python for ML scripts", user_id="u-001")

# Retrieve relevant memories
results = m.search("What language does this user prefer for ML?", user_id="u-001")
for r in results:
    print(r["memory"], r["score"])

Expected output: the Rust/Python preference fact should surface with a score above 0.85.

The Mem0 process itself is CPU-bound. It calls your local TEI endpoint for embeddings and Qdrant for storage. No GPU process runs inside Mem0. Only TEI uses the GPU.

Mem0 cloud vs. self-hosted: Mem0 offers a managed cloud tier. Self-hosting gives you data control, eliminates per-API-call embedding costs, and lets you swap the embedding model without changing application code. At 50M+ embedding tokens per month, self-hosting on an L40S is meaningfully cheaper.

Deploying Zep CE with a Summarization LLM (Qwen 3 or Llama 4)

Zep Community Edition requires an LLM for its knowledge graph extraction step. It calls the LLM over a standard OpenAI-compatible API, so any vLLM instance works.

Deploy the summarization LLM

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-8B \
  --dtype fp8 \
  --gpu-memory-utilization 0.45

The 0.45 utilization cap reserves VRAM for the TEI embedding and reranker containers running on the same GPU. Verify the model is loaded:

bash
curl localhost:8000/v1/models

Configure Zep CE

Clone the Zep CE repository and edit the .env file:

bash
git clone https://github.com/getzep/zep
cd zep
cp .env.example .env

Set these variables in .env:

bash
ZEP_LLM_PROVIDER=openai_compat
ZEP_LLM_BASE_URL=http://host.docker.internal:8000/v1
ZEP_LLM_MODEL=Qwen/Qwen3-8B
ZEP_EMBEDDINGS_SERVICE_URL=http://host.docker.internal:8080

Zep CE's docker-compose exposes port 8000 by default, which conflicts with the vLLM container already bound to that port. Before starting Zep, open docker-compose.yml and remap the Zep service's host port to 8003:

yaml
# in docker-compose.yml, under the zep service ports:
ports:
  - "8003:8000"
extra_hosts:
  - "host.docker.internal:host-gateway"

The extra_hosts entry is required on Linux. Without it, localhost inside the Zep container resolves to the container's own loopback, not the host machine where vLLM and TEI are running. host.docker.internal with the host-gateway entry maps correctly to the host.

Then start Zep CE:

bash
docker compose up -d

Test Zep memory extraction

python
from zep_python import ZepClient, Message

client = ZepClient(base_url="http://localhost:8003")

session_id = "session-001"
messages = [
    Message(role="user", content="I'm building an agent that deploys ML models to Kubernetes"),
    Message(role="assistant", content="Got it. What's your current Kubernetes version?"),
    Message(role="user", content="We're on 1.29, running on GKE in us-central1")
]

client.memory.add_memory(session_id=session_id, messages=messages)

# Give Zep a moment to run extraction (async)
import time
time.sleep(3)

memory = client.memory.get_memory(session_id=session_id)
print(memory.facts)

Zep will extract facts like "user is deploying ML models to Kubernetes 1.29 on GKE in us-central1" and store them in its knowledge graph. Future queries for context about this user will surface these structured facts.

The Zep CE extraction loop runs on your vLLM instance. Each session's message batch triggers one LLM call. At 500 sessions per day with 10 messages per session, you get 500 extraction calls: roughly 250 seconds of GPU time at 0.5 seconds per call on Qwen3-8B FP8.

GPU Sizing and Cost Math

GPUVRAMOn-DemandSpotWhat memory stack it runs
L40S PCIe48 GB$0.72/hrN/AEmbedding + reranker + 8B summarizer (Mem0 or Zep)
RTX PRO 600096 GB$1.70/hr$0.59/hrFull stack + 32B summarizer + medium inference model
A100 80GB PCIe80 GB$1.04/hr$1.14/hr*Full stack + 30B summarizer or 70B inference co-located

Pricing fluctuates based on GPU availability. The prices above are based on 24 Apr 2026 and may have changed. Check current GPU pricing for live rates.

\A100 80GB PCIe spot price is currently higher than on-demand due to capacity constraints. Spot pricing can invert like this when spot availability is tight. Check live pricing before assuming spot is cheaper.*

Memory operations cost math for 100 active users:

  • User activity: 100 users, 50 messages each per day = 5,000 messages
  • Memory writes: 1 embedding call per message = 5,000 embedding calls/day
  • Memory reads: 5 searches/day per user = 500 retrieval calls/day
  • Zep extraction: every 10 messages triggers one LLM call = 500 LLM calls/day

BGE-M3 on an L40S handles roughly 10,000 embeddings/second at batch size 64 for short text inputs (under 128 tokens). On longer memory facts, throughput drops proportionally. All 5,000 daily writes still take well under a minute total at typical memory content lengths. The reranker adds ~20ms per read query: 500 queries * 20ms = 10 seconds of GPU time per day.

The bottleneck is summarization. 500 Zep extraction calls at 0.5 seconds each = 250 seconds of GPU time per day.

Total: 250 seconds / 3600 = 0.069 GPU-hours/day. At L40S on-demand pricing of $0.72/hr, that is under $0.05/day for 100 users' worth of memory operations. The memory stack is not where your GPU bill comes from. Your inference model is.

When memory operations and inference run on the same node, the embedding and reranker containers use a fraction of the VRAM while your primary LLM fills the rest. Rent an L40S on Spheron and run both the memory stack and an 8B inference model on the same card.

Integrating Agent Memory with MCP and Multi-Agent Orchestration

Memory is a cross-cutting concern in agent architectures. Three patterns for wiring it in:

Pattern 1: Memory as MCP tools. Expose Mem0 or Zep as MCP tools (memory_search, memory_add, memory_update). The MCP server process runs on CPU. It calls your Mem0 or Zep HTTP API, which in turn calls the GPU-resident embedding server. From the agent's perspective, memory is just another tool call. The MCP server GPU deployment guide covers the wrapping pattern in detail.

Pattern 2: LangGraph integration. Mem0 has a LangGraph-compatible adapter that plugs into the BaseStore interface. Zep has a LangChain memory class. Both let you wire persistent memory into a LangGraph graph node without restructuring your agent logic. The memory node runs after each user turn and before the next planning step. LangGraph's built-in checkpointing handles conversation turns, but for long-term cross-session memory, Mem0 and Zep sit alongside it as a separate retrieval layer. The LangGraph vs LangChain comparison guide covers how checkpoint state and vector memory interact.

Pattern 3: Shared memory pool for multi-agent systems. Multiple agents write to and read from the same Mem0 or Zep instance, keyed by user_id or session_id. Agent A can write a memory about a user preference; Agent B can read it on the next turn without any direct inter-agent communication. See multi-agent AI system GPU infrastructure for the broader orchestration architecture.

Teams using CrewAI can follow the same stack - the CrewAI production deployment guide covers how to wire Mem0 into a crew with persistent cross-session memory.

When memory retrieval and RAG retrieval coexist in the same agent, they can share a single embedding server. The TEI instance handles both the document embedding pipeline and the memory write pipeline, since both call the same /embed endpoint. See the agentic RAG GPU infrastructure guide for how to co-locate the retrieval stack with a memory layer: the embedding server serves both workloads, and the vector stores are just two separate collections.


Agent memory is the GPU workload that most teams underestimate until they try to scale it. An L40S on Spheron runs the full Mem0 or Zep stack - embedding server, reranker, and 8B summarizer - for under $1/hr on-demand. For teams co-locating memory infrastructure with their primary inference model, the RTX PRO 6000 offers 96GB GDDR7 to fit both workloads on one node.

L40S GPU on Spheron → | Rent RTX PRO 6000 → | View all GPU pricing →

STEPS / 07

Quick Setup Guide

  1. Choose your memory backend and assess GPU requirements

    Decide between Mem0 (lightweight, vector-only), Zep (graph + vector, needs LLM), or Letta (in-context + archival). Mem0 requires only an embedding model on GPU. Zep requires both an embedding model and a summarization LLM. List each component and its VRAM requirement. Total should fit within 85% of your target GPU's VRAM capacity.

  2. Provision a GPU instance on Spheron

    Log into app.spheron.ai, select L40S (48GB) for a standard memory stack or RTX PRO 6000 (96GB) for stacks with 30B+ summarizers, or rent a GPU from the wider [Spheron GPU rental](/gpu-rental/) catalog. Choose on-demand for persistent production services. SSH in and confirm the GPU with nvidia-smi.

  3. Deploy the embedding model and reranker with TEI

    Run the TEI embedding container: docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id BAAI/bge-m3 --max-batch-tokens 65536. Start a second TEI instance for the reranker on port 8081: docker run --gpus all -p 8081:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id BAAI/bge-reranker-v2-m3. Confirm both are healthy with curl localhost:8080/health and curl localhost:8081/health.

  4. Deploy the summarization LLM with vLLM (required for Zep, optional for Mem0)

    Start vLLM with an 8B model: docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest --model Qwen/Qwen3-8B --dtype fp8 --gpu-memory-utilization 0.45. The 0.45 utilization cap leaves VRAM headroom for the TEI containers. Verify with curl localhost:8000/v1/models.

  5. Deploy Mem0 with self-hosted embedding

    Install Mem0: pip install mem0ai. Configure it to use your local embedding server: set the embedder provider to 'huggingface' and set huggingface_base_url to http://localhost:8080 (the TEI native endpoint, without the /v1 OpenAI-compat prefix). Initialize with your vector store backend (Qdrant or Chroma running as Docker containers). For production Qdrant configuration including HNSW tuning and sharding, see the [vector database GPU cloud deployment guide](/blog/self-host-vector-database-gpu-cloud-qdrant-milvus-weaviate/). Test with a memory write: mem.add('User prefers Python 3.12 for all scripts', user_id='user-001').

  6. Deploy Zep CE with a custom LLM config

    Pull Zep Community Edition (github.com/getzep/zep) and edit docker-compose.yml to remap the Zep service host port from 8000 to 8003 (ports: - '8003:8000') and add extra_hosts: ['host.docker.internal:host-gateway'] to the Zep service so it can reach host-side containers. Set ZEP_LLM_PROVIDER=openai_compat, ZEP_LLM_BASE_URL=http://host.docker.internal:8000/v1, ZEP_LLM_MODEL=Qwen/Qwen3-8B, and ZEP_EMBEDDINGS_SERVICE_URL=http://host.docker.internal:8080 in the .env file. Then run docker compose up -d. Connect to Zep using base_url=http://localhost:8003.

  7. Connect agent memory to your agent framework or MCP server

    For LangGraph: use the built-in MemorySaver or wire Mem0/Zep as external stores via a custom checkpoint. For MCP: wrap your Mem0 or Zep client as MCP tools (memory_search, memory_add, memory_update) using the Python MCP SDK. The MCP server runs on CPU - only the backend embedding and LLM containers need GPU. See the MCP server deployment guide for the wrapping pattern.

FAQ / 06

Frequently Asked Questions

Agent memory systems run two GPU-intensive operations: embedding new memories into vector space (so they can be retrieved later) and summarizing long conversation threads into structured facts using an LLM. On CPU, a single embedding call takes 50-200ms. On GPU, it drops to 1-3ms. At 100 memory writes per user per day across thousands of users, that gap compounds fast. The summarization step is the more expensive one: a Llama 3.1 8B on GPU compresses a 20-turn conversation into a structured memory object in under one second. On CPU, the same task takes 30-90 seconds.

Mem0 extracts structured facts from conversations and stores them in a vector database, then retrieves them by semantic similarity on future turns. It is the most lightweight option. Zep builds a temporal knowledge graph from conversations using an LLM to extract entities and relationships, then combines graph traversal with vector search for retrieval. It handles time-aware queries better than pure vector stores. Letta (formerly MemGPT) manages memory inside the agent's own context window: a smaller in-context core memory (always present) and a larger archival memory (retrieved on demand). It requires a capable LLM as the core reasoning model.

A minimal stack covering embedding and an 8B summarization model fits in 12-16GB VRAM: BGE-M3 (1 GB) plus a BGE reranker (1.5 GB) plus Llama-3.1-8B-Instruct FP8 (8 GB) leaves 38GB free on an L40S for batch processing. A mid-tier stack with a 32B summarizer needs 35-40GB. For teams running both memory operations and primary LLM inference on the same node, an L40S (48GB) covers 8B-class summarizers and an RTX PRO 6000 (96GB) covers 32B-class summarizers alongside a 30B inference model.

Yes, on any GPU with 24GB+ VRAM, you can colocate the TEI embedding server (1-2GB VRAM), a reranker (1-2GB VRAM), and either the Mem0 server or Zep CE server (which are CPU-bound themselves). The GPU is used only by the embedding and reranker containers. If you also need a summarization LLM (required for Zep's knowledge graph extraction), add an 8B model on the same L40S. Only run separate instances if your memory write volume saturates the embedding server.

L40S (48GB) is the practical default for most teams. It fits the full memory stack (embedding + reranker + 8B summarizer) with room to spare, and the on-demand price makes it cost-effective for persistent workloads. RTX PRO 6000 (96GB GDDR7) is the right choice when you need a 30B+ summarizer alongside the memory stack or when you are co-locating memory infrastructure with primary LLM inference. A100 80GB PCIe is a strong alternative at similar VRAM with slightly higher memory bandwidth.

Managed embedding APIs (OpenAI text-embedding-3-large) cost $0.13 per 1M tokens. An L40S on Spheron at on-demand pricing handles roughly 216M embedding tokens per hour at sustained throughput using BGE-M3, which works out to under $0.01 per 1M tokens at 50% utilization. The crossover point is around 50-100M embedding tokens per month. Managed LLM APIs for summarization cost $1-5 per 1M tokens. Self-hosting an 8B model on L40S brings that below $0.02 per 1M tokens at reasonable utilization.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.