Tutorial

Self-Host Perplexity-Style AI Search on GPU Cloud: Deploy Perplexica, Morphic, and SearXNG (2026)

Self Host PerplexityPerplexica DeploymentOpen Source AI Search EngineSelf Hosted AI Search GPUPerplexity AlternativeSearXNGMorphicMindSearchOpenPerplexLlama 3.3 70BSemantic CachingMultimodal AI SearchGPU Cloud
Self-Host Perplexity-Style AI Search on GPU Cloud: Deploy Perplexica, Morphic, and SearXNG (2026)

Perplexity Pro is $20/month for personal use. The moment you want API access for your product, the cost is a different conversation. Developer API pricing starts at $5 per 1M tokens for their base models, and a product doing 100k queries a month with 4k tokens of context per query easily runs $500-$2,000 a month before you hit rate limits. Teams doing internal tooling, enterprise search, or privacy-sensitive research have been self-hosting alternatives for exactly this reason.

This post covers deploying Perplexica on a Spheron H100 SXM5 instance with a vLLM backend running Llama 3.3 70B in FP8 quantization. We'll go through the GPU sizing math, the full Docker Compose setup, multimodal search with Qwen2.5-VL, semantic caching, and a cost comparison at 100k queries per month against Perplexity Pro API pricing.

What an AI Search Engine Actually Does

Every AI search system runs the same four-stage pipeline. Understanding it tells you exactly where the VRAM goes and where the latency comes from.

  1. Query expansion - The user's query gets rewritten or expanded by a small LLM or heuristic to improve recall. "how to run llama 70b" might expand to "llama 3 70b deployment docker vllm gpu requirements".
  2. Web retrieval - SearXNG sends the expanded query to multiple upstream search engines (Google, Bing, Brave, DuckDuckGo) in parallel, aggregates the results, and de-duplicates by URL. This stage is fast (50-200ms) but hits external APIs.
  3. Reranking - A cross-encoder model (typically BGE-Reranker-v2-m3) scores each retrieved document against the original query and returns a ranked list. This is where you separate relevant from tangentially-related results.
  4. LLM synthesis - The top-k passages (usually 5-10) get packed into the LLM's context window alongside the original query. The model writes a grounded answer with inline citations.

For teams building on top of these patterns, see retrieval-augmented generation infrastructure for a deeper breakdown of the retrieval side of the stack.

Here is how each component maps to VRAM and latency:

ComponentVRAM costLatency contribution
Query expansion (optional)0-4 GB50-200ms
SearXNG retrieval0 (CPU)50-300ms
BGE-M3 reranker~0.8 GB100-300ms
all-MiniLM-L6-v2 embedding~0.1 GB10-50ms
Llama 3.3 70B synthesis (FP8)~38 GB500ms-2s TTFT

Open-Source AI Search Stacks Compared

ProjectLanguageSearch backendLLM backendBest for
PerplexicaTypeScriptSearXNGAny OpenAI-compatibleSelf-hosted, Docker-first
MorphicNext.jsTavily / ExaOpenAI, GroqClean UI, managed search
MindSearchPythonBing APIAnyComplex multi-step queries
OpenPerplexPythonGoogle SerpAPIAnyAPI-first, headless

Perplexica is the focus here for three reasons. First, it ships a working docker-compose.yml that includes SearXNG as a sub-service, so you don't need a separate search backend setup. Second, it has a configurable backend URL, meaning you can point it at any OpenAI-compatible server including vLLM, Ollama, or LM Studio. Third, it doesn't depend on paid search APIs when running SearXNG locally, which matters for cost control and privacy. The community is also active and keeps the Perplexity UI parity reasonably close.

GPU Sizing: VRAM and Concurrency Math

This is the part most guides skip over. Here is the full VRAM budget for the production stack:

Component                    VRAM
─────────────────────────────────
Llama 3.3 70B (FP8)         ~38 GB
BGE-M3 reranker              ~0.8 GB
all-MiniLM-L6-v2             ~0.1 GB
KV cache (16 sessions        ~9 GB
  × 8k ctx × bf16)
Buffer                        ~4 GB
─────────────────────────────────
Total                        ~52 GB

An H100 SXM5 (80 GB) covers this with roughly 28 GB of headroom for burst KV cache during traffic spikes. The L40S (48 GB) only fits the stack with INT4 quantization on the LLM, or by moving the reranker and embeddings to CPU, which adds 100-300ms of latency per query. For multi-GPU deployments running the LLM in FP16 (70 GB for Llama 3.3 70B), tensor-parallel across 2x H100 works well with vLLM's --tensor-parallel-size 2 flag.

For the full memory calculator across different quantization levels and context lengths, see the GPU memory calculator.

For deploy embeddings and rerankers on GPU with a dedicated TEI server alongside the main LLM, you can shave the VRAM cost of those components down further by serving them on a separate process with shared GPU memory.

QPS math: At 2s average synthesis latency (TTFT + decode for a 400-token answer), one H100 SXM5 with continuous batching handles ~30 concurrent requests, which translates to roughly 15 QPS sustained with 100ms average queue time. For 1k QPS sustained, you'd need approximately 70 H100s or 35 B200s. Most real deployments sit in the 1-50 QPS range and operate fine on 1-2 GPUs.

For H100 SXM5 rental at $4.34/hr on-demand, you can run the complete stack on a single instance and shut it down between usage periods.

Step-by-Step Perplexica Deployment on Spheron

Step 1: Provision the GPU instance

Log in to app.spheron.ai, select H100 SXM5 (80 GB), and pick Ubuntu 22.04 with CUDA 12.4. Once the instance is up, SSH in and verify:

bash
nvidia-smi
# Should show H100 SXM5 with ~80 GB VRAM

If Docker is not pre-installed:

bash
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker

Step 2: Launch the vLLM backend

Start vLLM with FP8 quantization. Note the --add-host flag, which is required on Linux to make host.docker.internal resolve correctly inside containers (this differs from macOS behavior):

bash
docker run --gpus all --ipc=host \
  --add-host=host.docker.internal:host-gateway \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.78

The --gpu-memory-utilization 0.78 leaves ~18 GB for the reranker, embeddings, and KV cache buffer. On first launch, vLLM prints the actual VRAM allocation in its logs. The FP8 Llama 3.3 70B checkpoint typically loads at 38-40 GB depending on the distribution; adjust --gpu-memory-utilization if the model is slightly larger.

Verify the server is up:

bash
curl http://localhost:8000/v1/models
# Returns {"object":"list","data":[{"id":"meta-llama/Llama-3.3-70B-Instruct",...}]}

For production multi-GPU configurations, load balancing, and FP8 calibration details, see the vLLM production deployment guide.

Step 3: Clone Perplexica and configure the vLLM endpoint

bash
git clone https://github.com/ItzCrazyKns/Perplexica.git
cd Perplexica
cp sample.config.toml config.toml

Edit config.toml to point at your vLLM server:

toml
[MODELS]
CHAT_PROVIDER = "custom_openai"
CHAT_MODEL = "meta-llama/Llama-3.3-70B-Instruct"
EMBEDDING_PROVIDER = "local"  # uses Ollama sidecar, or replace with custom endpoint

[MODELS.CUSTOM_OPENAI]
API_KEY = "none"  # vLLM doesn't need a real key
BASE_URL = "http://host.docker.internal:8000/v1"

Step 4: Start the full stack with Docker Compose

Here is the core docker-compose.yml. The Ollama sidecar handles embeddings; you can replace it with a dedicated TEI container if you want GPU-accelerated embeddings:

yaml
services:
  searxng:
    image: searxng/searxng:latest
    ports:
      - "8080:8080"
    volumes:
      - ./searxng:/etc/searxng
    environment:
      - SEARXNG_SETTINGS_PATH=/etc/searxng/settings.yml
    restart: unless-stopped

  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped

  perplexica-backend:
    build:
      context: .
      dockerfile: backend.dockerfile
    ports:
      - "3001:3001"
    depends_on:
      - searxng
      - ollama
    environment:
      - CONFIG_PATH=/app/config.toml
    volumes:
      - ./config.toml:/app/config.toml
    extra_hosts:
      - "host.docker.internal:host-gateway"
    restart: unless-stopped

  perplexica-frontend:
    build:
      context: .
      dockerfile: app.dockerfile
    ports:
      - "3000:3000"
    depends_on:
      - perplexica-backend
    restart: unless-stopped

volumes:
  ollama_data:

If you're running vLLM separately as in Step 2, you can drop the ollama service from the compose file or keep it only for embedding generation.

Start it:

bash
docker compose up -d
docker compose logs -f perplexica-backend  # watch for startup errors

Step 5: Verify the latency waterfall

Open Perplexica at http://localhost:3000 and run a test query. Watch the network tab in devtools or the backend logs for the full waterfall:

  • SearXNG retrieval: 50-200ms
  • Embedding + reranking: 100-300ms
  • vLLM synthesis (time to first token): 500ms-2s

Check that the LLM is on GPU, not CPU:

bash
watch -n1 nvidia-smi
# GPU memory used should be 38-42 GB while a query runs
# GPU-Util should spike above 50% during the synthesis phase

If GPU-Util stays at 0% during synthesis, the vLLM container is not being used. Check that BASE_URL in config.toml points to the correct host.

SearXNG rate limits note: Default SearXNG hits external engines without API keys. At low query volumes this is fine. At higher throughput (50+ QPS), upstream engines start returning CAPTCHAs. Configure SearXNG with API keys for Google Programmable Search, Bing Search API, or Brave Search to avoid this in production.

Step 6: Add semantic caching

Deploy semantic caching in front of the vLLM endpoint to skip inference for repeated or similar queries. See semantic cache for LLM inference for the full GPTCache and Redis vector cache setup.

The short version: GPTCache wraps the vLLM API and intercepts calls. It embeds the incoming query, checks cosine similarity against a Redis vector index of recent queries, and returns a cached response if similarity exceeds the threshold (0.95 works well for search queries where exact meaning matters).

Adding Multimodal Search with Qwen2.5-VL

The H100 SXM5 has roughly 28 GB of VRAM headroom after the main Llama 3.3 70B stack loads. That's enough to co-locate Qwen2.5-VL-7B with INT8 quantization (~7 GB) on the same GPU, giving you image understanding for queries that include screenshots, charts, or photos.

Start the VL model on port 8001 with INT8 quantization. Running FP16 without quantization is not viable here: FP16 weights alone occupy ~14 GB, leaving essentially no room for KV cache at 0.18 utilization, which causes vLLM to abort at startup with an insufficient-KV-cache error.

bash
docker run --gpus all --ipc=host \
  --add-host=host.docker.internal:host-gateway \
  -p 8001:8001 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --port 8001 \
  --quantization bitsandbytes \
  --gpu-memory-utilization 0.18

The --gpu-memory-utilization 0.18 allocates about 14.4 GB on the H100. With INT8 quantization, Qwen2.5-VL-7B weighs ~7 GB, leaving ~7 GB for KV cache. Combined utilization with the main model: 0.78 + 0.18 = 0.96, which fits with a small OS/driver buffer.

Concurrency caveat: At high query volume, running both models on one GPU causes KV cache contention. If you're consistently above 20 concurrent users, route visual queries to a second GPU. The colpali-multimodal-document-rag-gpu-cloud post covers multimodal retrieval patterns for deeper document-level image search.

In Perplexica, image search mode sends image-bearing queries to a configurable endpoint. Set IMAGE_MODEL_BASE_URL in your config to http://host.docker.internal:8001/v1 and IMAGE_MODEL = "Qwen/Qwen2.5-VL-7B-Instruct" to route those queries to the VL model.

Semantic Cache: Cut GPU Cost by 40-60%

Three caching tiers, each progressively more complex:

TierLatencyCache hit rateComplexity
Exact-match (Redis hash)<1ms5-15%Low
Semantic (vector similarity)5-20ms30-50%Medium
SearXNG result cache<5ms (retrieval only)20-40%Low

Exact-match cache: Hash the raw query string and check Redis. This catches repeated identical queries, which are common in search (people rephrase and re-submit the same question). Serves in under 1ms.

Semantic cache: Embed the incoming query, compare cosine similarity against the last N query embeddings in a Redis vector index. Threshold 0.93-0.97 depending on how much variation you tolerate in "similar" answers. At 0.95, a query like "what GPU for llama 70b" hits the cache for "which GPU runs llama 3 70b" and "best GPU for 70 billion parameter LLM".

SearXNG result cache: SearXNG has built-in Redis caching. Set cache.expire = 600 in searxng/settings.yml to cache retrieval results for 10 minutes. This skips the external API calls for repeated or similar queries, not just the LLM synthesis step.

At typical search distributions, 40-60% of queries within a 24-hour window are similar enough to hit a well-tuned semantic cache. That directly cuts the GPU hours required per query and lets a single H100 handle 2-2.5x more effective queries per month.

Cost Benchmark: Self-Hosted vs Perplexity Pro API

At 100,000 queries per month:

ConfigurationMonthly GPU costCost per queryNotes
Spheron H100 PCIe (on-demand)~$1,447/mo ($2.01/hr × 720 hr)$0.0145Most affordable Hopper 80 GB option
Spheron H100 SXM5 (on-demand)~$3,125/mo ($4.34/hr × 720 hr)$0.031Higher NVLink bandwidth for multi-GPU scale-out
Spheron B200 SXM6 (spot only)~$1,483/mo ($2.06/hr × 720 hr)$0.0148Spot-only; ~2x H100 FP8 throughput
Perplexity API (Pro tier)~$500-$2,000/mo$0.005-$0.02Varies by model tier; rate limits apply
AWS p3.2xlarge (V100, 16 GB)~$2,203/moN/AV100 16 GB cannot fit Llama 3.3 70B FP8 (38 GB required)

With 60% semantic cache hit rate, effective GPU utilization drops to 40%, meaning the $1,447/month H100 PCIe handles roughly 250,000 effective queries per month. The cost per query at that point is $0.006, well below any managed API option at scale.

The break-even point is around 50,000-70,000 queries per month: below that, Perplexity's $20/month Pro plan for personal use or the managed API is cheaper when accounting for engineering time. Above it, self-hosting wins on both cost and the ability to set your own rate limits, use custom models, and keep query data private.

For B200 SXM6 instances (Blackwell architecture), FP8 throughput is roughly 2x compared to H100, meaning a single B200 could handle the same query volume with half the synthesis latency. B200 SXM6 is currently available as spot-only on Spheron at $2.06/hr ($1,483/mo), making it cost-competitive with H100 PCIe while delivering significantly higher throughput for workloads at 500+ QPS.

Pricing fluctuates based on GPU availability. The prices above are based on 07 May 2026 and may have changed. Check current GPU pricing → for live rates.

Production Hardening

Four areas that matter for a deployment serving real users.

Rate limiting

Put nginx in front of Perplexica. A simple per-IP rate limit prevents abuse:

nginx
limit_req_zone $binary_remote_addr zone=search:10m rate=5r/s;

server {
    location / {
        limit_req zone=search burst=20 nodelay;
        proxy_pass http://localhost:3000;
    }
}

This allows 5 requests per second per IP with a burst of 20 before returning 429.

Authentication

For multi-user or team access, add Authelia as a Docker Compose service in front of Perplexica:

yaml
services:
  authelia:
    image: authelia/authelia:latest
    volumes:
      - ./authelia:/config
    ports:
      - "9091:9091"
    restart: unless-stopped

For simpler single-team setups, nginx basic auth with a hashed password file is sufficient.

Search quality evaluation

Run a weekly eval loop against 100 golden queries where you know the expected answer. RAGAS measures faithfulness (does the answer match the retrieved context?) and context recall (did retrieval surface the right documents?). A drop in either metric usually means SearXNG is returning low-quality results for that query cluster or the reranker threshold needs adjustment.

Observability

Wire vLLM's Prometheus metrics to Grafana. Add to your docker-compose.yml:

yaml
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3030:3000"
    depends_on:
      - prometheus

vLLM exposes /metrics on the same port as the API by default. The key metrics to watch: vllm:num_requests_running, vllm:gpu_cache_usage_perc, and vllm:time_to_first_token_seconds. For a complete stack including query tracing and hallucination detection, see the LLM observability stack guide covering Langfuse, Arize, and Phoenix.

When to Self-Host vs Use the Perplexity API

FactorSelf-hostManaged API
Query volumeAbove 50k/monthBelow 50k/month
PrivacyQueries stay on your infraQueries go to Perplexity
Custom modelsAny model you wantWhat Perplexity offers
Team sizeHas DevOps capacityNo infra team
Latency SLATunable (can be lower)Fixed by the provider
Uptime responsibilityYou own itPerplexity owns it
Setup time2-4 hours10 minutes

The managed API wins clearly at low volume and when you have no infrastructure team. It is also the right call during a prototype phase before you know your query volume. Self-hosting becomes the better option when you have consistent query volume above the break-even point, need to use custom fine-tuned models, or have compliance requirements that prohibit sending user queries to third-party APIs.


Self-hosting a Perplexica-style search engine gives you full query privacy, custom model control, and a lower cost per query at scale. Spheron H100 SXM5 instances fit the complete Llama 3.3 70B + reranker + embedding stack in a single 80 GB GPU, with H100 PCIe on-demand starting at $2.01/hr.

H100 SXM5 pricing → | Rent B200 SXM6 → | View all GPU pricing →

Deploy your AI search stack on Spheron →

STEPS / 06

Quick Setup Guide

  1. Size GPU memory for the full AI search stack

    List every component: embedding model (~1 GB), BGE-M3 reranker (~1 GB), Llama 3.3 70B in FP8 (~38 GB), and KV cache buffer (8-12 GB for 16 concurrent queries at 8k context). Total: ~50 GB. An H100 SXM5 (80 GB) fits comfortably; an H100 PCIe (80 GB) also works but with lower NVLink bandwidth for multi-GPU scale-out.

  2. Provision a GPU instance on Spheron

    Log in to app.spheron.ai, select H100 SXM5 (80 GB), choose Ubuntu 22.04 + CUDA 12.4, and deploy. SSH in and verify CUDA with: nvidia-smi. Install Docker and Docker Compose if not pre-installed: curl -fsSL https://get.docker.com | sh && sudo usermod -aG docker $USER.

  3. Install vLLM and start the Llama 3.3 70B backend

    Pull and launch vLLM with FP8 quantization: docker run --gpus all --ipc=host --add-host=host.docker.internal:host-gateway -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.3-70B-Instruct --quantization fp8 --max-model-len 8192 --gpu-memory-utilization 0.78. The --add-host flag is required on Linux to make host.docker.internal resolve correctly inside containers. This exposes an OpenAI-compatible API on port 8000. Verify with: curl http://localhost:8000/v1/models.

  4. Clone Perplexica and configure the vLLM endpoint

    Clone the Perplexica repo, copy the sample config: git clone https://github.com/ItzCrazyKns/Perplexica.git && cd Perplexica && cp sample.config.toml config.toml. Edit config.toml: under [MODELS], set CHAT_PROVIDER = "custom_openai"; under [MODELS.CUSTOM_OPENAI], set BASE_URL = "http://host.docker.internal:8000/v1" and API_KEY = "none". Run docker compose up -d to start the SearXNG meta-search backend, the Perplexica app, and the Ollama sidecar (which you can disable if using vLLM for all inference).

  5. Run a test query and verify end-to-end latency

    Open Perplexica at http://localhost:3000, run a test query, and observe the waterfall: SearXNG retrieval (50-200ms), embedding + reranking (100-300ms), vLLM synthesis (500ms-2s for first token). Check GPU utilization with watch -n1 nvidia-smi to confirm the LLM is loading to GPU, not CPU.

  6. Add semantic caching to cut repeat-query GPU cost

    Deploy GPTCache or Redis + a sentence-transformer similarity index in front of the vLLM endpoint. Configure a cosine similarity threshold of 0.95 for cache hits. Repeated queries (same topic, rephrasings) serve from cache in <5ms and skip LLM inference entirely. At typical query distributions, 40-60% of queries hit the cache within 24 hours, directly cutting GPU hours per query.

FAQ / 05

Frequently Asked Questions

A production stack with Llama 3.3 70B (FP8 quantized), a BGE-M3 reranker, and a text embedding model needs roughly 48-52 GB of GPU VRAM. An H100 SXM5 (80 GB) comfortably fits the full stack with room for KV cache. For higher concurrency or FP16 weights, two H100s let you run the LLM across both cards while keeping embeddings on a third card or CPU.

At 100,000 queries per month, a dedicated H100 PCIe instance on Spheron costs roughly $1,447/month at $2.01/hr and handles the full stack (LLM + reranker + embeddings). The H100 SXM5 at $4.34/hr runs $3,125/mo and delivers higher bandwidth for concurrent queries. Perplexity Pro is $20/month for personal use but the equivalent API access for 100k developer queries runs $500-$2,000/month depending on model tier and document count. The break-even is around 50,000-70,000 queries/month, after which self-hosting wins on cost and gives you full privacy, no rate limits, and custom model control. Prices fluctuate based on GPU availability.

Perplexica is a Docker-first Perplexity clone that wires SearXNG to any OpenAI-compatible LLM backend. Morphic is a Next.js web app with a cleaner UI and built-in Tavily/Exa search integration. MindSearch is a research prototype from Shanghai AI Lab's InternLM team that uses multi-agent graph reasoning for complex queries. OpenPerplex is a Python-native API-first option. Perplexica is the easiest to self-host on bare GPU because it already ships a Docker Compose file with a configurable backend URL.

Yes. Perplexica accepts any OpenAI-compatible endpoint so Ollama works fine for single-user or low-concurrency setups. vLLM is a better choice for production because it supports continuous batching and PagedAttention, which lets it serve dozens of simultaneous search queries without queue stalls. At 100+ QPS, Ollama saturates a single GPU much faster than vLLM.

Yes, with an important nuance: the LLM and reranker run entirely on your own GPU instance, so queries and results never leave your infrastructure. However, SearXNG still sends search queries to external engines (Google, Bing, DuckDuckGo) unless you configure it with only local or private sources. The LLM synthesis step is fully private; the web retrieval step is only private if you control the upstream search backend.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.