Deploy Aphrodite Engine on GPU Cloud: EXL2, GGUF, GPTQ Serving and LoRA Hot-Swap (2026 Guide)

Aphrodite Engine is the serving framework PygmalionAI maintains as a community-focused fork of vLLM. Where vLLM tracks the mainstream production use case (BF16, AWQ, FP8 on recent NVIDIA hardware), Aphrodite extends the envelope toward community-quantized checkpoints: EXL2 via ExLlamaV2 kernels, GGUF via llama.cpp kernels, and the full GPTQ-Marlin fast-path. It also ships advanced samplers (DRY, XTC, min-p, mirostat) and a KoboldAI-compatible API that tools like SillyTavern and Kobold Lite expect. If you've benchmarked vLLM, TensorRT-LLM, and SGLang for mainstream production serving and still have a pile of EXL2 or GPTQ checkpoints from Hugging Face, Aphrodite is the fourth framework worth evaluating.

TL;DR

Engine	Quantization Formats	Samplers	API Surfaces	License	Best For
Aphrodite Engine	EXL2, GGUF, GPTQ, AWQ, FP8, Marlin	DRY, XTC, min-p, mirostat, eta-cutoff	OpenAI + KoboldAI	AGPL-3.0	Community quants, chat/roleplay, mixed GPU fleets
vLLM	AWQ, GPTQ, FP8, Marlin, BF16	Top-p, top-k, temperature	OpenAI	Apache 2.0	General production, broadest model support
SGLang	AWQ, FP8, BF16	Top-p, top-k	OpenAI	Apache 2.0	Shared-prefix workloads, RAG, low latency
LMDeploy	AWQ, MXFP4, BF16	Top-p, top-k	OpenAI	Apache 2.0	InternLM, DeepSeek, mixed-precision batching

What Aphrodite Engine Is and How It Differs From vLLM

Aphrodite Engine started as a fork of vLLM in late 2023, maintained by PygmalionAI, the team behind the Pygmalion and MythoMax character AI models. The fork happened because vLLM's roadmap was converging on mainstream cloud serving (large batches, FP8, Tensor Parallelism) and not serving the community of people running quantized models on single-GPU or small-cluster setups for chat and roleplay applications.

The core batching engine (PagedAttention, continuous batching) is essentially the same as vLLM. The divergence is in the quantization backends and sampling layer.

EXL2 (ExLlamaV2): vLLM has no EXL2 support. Aphrodite bundles ExLlamaV2 kernels, which means EXL2 checkpoints from the community (heavily used on Hugging Face for quantized Llama-3 and Mistral derivatives) run natively without converting to a different format first.

GGUF: Aphrodite serves GGUF via llama.cpp kernels embedded in the server process. This covers standard K-quant methods (Q4_K_M, Q5_K_M, Q8_0). One caveat: this is not a full llama.cpp replacement. Mixed-precision K-quant variant coverage is more limited than a native llama.cpp server. For GGUF-centric workflows on GPU cloud, the GGUF quantization deployment guide covers the llama.cpp server path in more depth.

GPTQ-Marlin: Aphrodite auto-detects Marlin-compatible GPTQ checkpoints and activates the fast-path automatically. The same GPTQ checkpoint that ran on AutoGPTQ can run on Aphrodite with Marlin kernels without any re-quantization.

Advanced samplers: Standard vLLM exposes temperature, top-p, top-k, and repetition_penalty. Aphrodite adds DRY (repetition penalty based on repeated sequences rather than individual tokens), XTC (removes high-probability tokens with a configurable probability to encourage diversity), min-p (minimum probability filter as a cleaner alternative to top-p), mirostat (entropy-based dynamic sampling), and eta-cutoff. These matter for chat and roleplay products where output quality is sensitive to repetition artifacts.

KoboldAI API: When launched with the --launch-kobold-api flag, Aphrodite exposes a KoboldAI-compatible endpoint on the same port 2242 at /api/v1/generate. SillyTavern, Kobold Lite, and Oobabooga UI all target KoboldAI endpoints natively.

License: Aphrodite Engine is AGPL-3.0. vLLM, SGLang, and LMDeploy are Apache 2.0. AGPL-3.0 has network copyleft implications: if you expose Aphrodite over a network as part of a commercial service, AGPL obligates you to release the source of that service. For internal tooling, research, and personal deployments this is a non-issue. For commercial SaaS products, consult your legal team before using Aphrodite as the inference backend.

For the vLLM production deployment path, see vLLM Multi-GPU Production Deployment 2026. For SGLang, see the SGLang production deployment guide.

When to Choose Aphrodite Engine

Use Aphrodite when one or more of these applies:

You have a library of community-quantized checkpoints in EXL2 or TheBloke-style GPTQ format from Hugging Face. Aphrodite runs them as-is; vLLM would require conversion or a separate EXL2 server.
Your product does chat or roleplay where advanced samplers reduce repetition and improve output naturalness. DRY and XTC have a measurable effect on long-form chat generations.
You need KoboldAI-compatible clients (SillyTavern, Kobold Lite) alongside an OpenAI-compatible API.
Your GPU fleet is mixed or older. Aphrodite supports Pascal (GTX 1080, P100) and newer. This is a wider range than most serving frameworks, which typically require Volta or Ampere minimum.
You need per-request LoRA hot-swapping with quantized base models. Aphrodite's LoRA support is compatible with its quantized backends (GPTQ, EXL2), which vLLM's LoRA serving does not cover for EXL2.
Budget inference on smaller GPUs. GPTQ and EXL2 allow 70B models to run on 24-48GB cards. Pair this with GPU cloud pricing to find the right cost point.

For GPU hardware selection and inference sizing, see Best GPUs for AI Inference 2026.

Do not use Aphrodite when:

You need Apache 2.0 licensing for a commercial service. Use vLLM or SGLang instead.
You're running mainstream BF16 or FP8 at high concurrency. vLLM's MRV2 or TensorRT-LLM will outperform Aphrodite at those workloads.
You need multi-node Tensor Parallelism beyond 8 GPUs. Aphrodite's focus is single-node.

Quantization Matrix: EXL2, GGUF, GPTQ, AWQ, Marlin, FP8

Format	VRAM vs BF16	Quality vs BF16	GPU Requirement	Aphrodite Flag	Best For
EXL2	~25% (4-bit) to ~50% (8-bit)	Good to Near-lossless	CUDA GPU	`--quantization exl2`	Community checkpoints, per-layer mixed precision
GGUF	~25-50% (K-quants)	Good (Q4_K_M), Near-lossless (Q8_0)	Optional (CPU/GPU)	`--quantization gguf`	CPU fallback, single-file distribution
GPTQ	~25%	Good	CUDA GPU	`--quantization gptq`	Legacy TheBloke-format checkpoints
AWQ	~25%	Good	CUDA GPU	`--quantization awq`	GPU-optimized INT4, better quality than GPTQ
Marlin	~25%	Good	Ampere+	Auto-detected	GPTQ checkpoints with Marlin fast-path
FP8	~50%	Near-lossless	Hopper+ (H100, H200)	`--quantization fp8`	Production throughput on modern hardware

EXL2 is the format ExLlamaV2 produces via quantization calibration. It assigns different bit-widths per layer based on measured sensitivity, similar in concept to GPTQ but with ExLlamaV2's own calibration approach. Aphrodite bundles ExLlamaV2 in the package, so no separate install is needed. The community has produced EXL2 variants of most popular models in 2-bit through 8-bit configurations. For a 70B model, EXL2 at 4.0 bpw typically requires 34-38GB VRAM.

GGUF runs via llama.cpp kernels inside Aphrodite. The advantage over a standalone llama.cpp server is the shared process: you get Aphrodite's batching and LoRA support alongside GGUF models. The tradeoff is less complete K-quant coverage than native llama.cpp. Q4_K_M, Q5_K_M, and Q8_0 are reliable; exotic K-quant variants may not be. For workflows centered purely on GGUF, a dedicated llama.cpp server is still the recommended path.

GPTQ-Marlin is the standard path for TheBloke-style GPTQ checkpoints. Aphrodite auto-detects whether a GPTQ checkpoint is Marlin-compatible (Ampere+ GPUs, group_size divisible by 128) and activates the fast-path. On older GPUs or non-Marlin-compatible checkpoints, it falls back to standard GPTQ kernels. For the full AWQ vs GPTQ decision framework, see the AWQ quantization guide.

FP8 on H100 and H200 uses Hopper Tensor Cores for hardware-accelerated 8-bit float matrix ops. For the hardware specifics and VRAM math behind FP8, see the FP8 quantization and inference performance guide.

Hardware Sizing on Spheron GPU Cloud

Aphrodite's broad quantization support lets you match the right GPU to the right model size and format. These prices are from the Spheron API on 04 Jul 2026.

GPU	VRAM	On-demand ($/hr)	Spot ($/hr)	Aphrodite Use Case
H100 SXM5 on Spheron	80GB	$2.54	$1.43	70B BF16, 70B FP8, multi-adapter LoRA at scale
H200 SXM5 on Spheron	141GB	$3.70	$1.82	70B BF16 long-context, 140B EXL2, large KV cache
A100 80GB SXM4 on Spheron	80GB	$1.69	$0.79	7-13B BF16, 70B GPTQ/EXL2, multi-adapter LoRA

Pricing fluctuates based on GPU availability. The prices above are based on 04 Jul 2026 and may have changed. Check current GPU pricing for live rates.

L40S, RTX 4090, and RTX 5090 have no on-demand offers currently listed in the Spheron API. Check Spheron GPU pricing for live availability, including spot options for those models as they become available.

For 7-13B models in GPTQ or EXL2 format, the A100 80GB is the cost-efficient choice at $1.69/hr on-demand, with enough VRAM headroom for long contexts and adapter caches. For 70B inference at FP8 or EXL2 4-bit, the H100 SXM5's HBM3 bandwidth (3.35 TB/s) reduces decode latency significantly. H200 SXM5 is the right call for very long contexts or models that won't fit in 80GB even quantized.

Install and Dependency Setup

pip install

bash

pip install aphrodite-engine

CUDA requirements:

CUDA 11.8 minimum for most backends
CUDA 12.4+ required for FP8 (Hopper Tensor Core path)
ExLlamaV2 kernels are bundled in aphrodite-engine; no separate install needed

Verify the install:

bash

python -c "import aphrodite; print(aphrodite.__version__)"

Docker

bash

docker pull alpindale/aphrodite-engine:latest

docker run --gpus all -p 2242:2242 \
  alpindale/aphrodite-engine:latest \
  aphrodite run <model-id> --launch-kobold-api

The Docker image bundles ExLlamaV2 and llama.cpp kernels. If you're building from source, both are compiled as part of the Aphrodite build process and do not require a separate installation step.

Deploying Models: EXL2, GGUF, GPTQ, and AWQ

EXL2

EXL2 checkpoints are typically hosted on Hugging Face with names ending in -EXL2 or with exl2 in the repo description. Point Aphrodite at the local path or a HuggingFace model ID:

bash

aphrodite run turboderp/Llama-3.3-70B-Instruct-4.0bpw-EXL2 \
  --quantization exl2 \
  --port 2242

ExLlamaV2 kernels handle the weight loading and decode. No separate exllamav2 package install is needed.

GGUF

bash

aphrodite run ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --quantization gguf \
  --tokenizer meta-llama/Llama-3.3-70B-Instruct \
  --port 2242

The --tokenizer flag points to the HuggingFace repo for the tokenizer since the GGUF file may not include a tokenizer Aphrodite can parse directly. For Q4_K_M at 70B, expect 38-42GB VRAM usage.

GPTQ (with Marlin auto-detect)

bash

aphrodite run TheBloke/Llama-2-70B-Chat-GPTQ \
  --quantization gptq \
  --port 2242

On Ampere+ GPUs with a compatible GPTQ checkpoint (group_size 128, Ampere-ready quantization), Aphrodite activates the Marlin fast-path automatically. No separate flag is needed. You can confirm Marlin is active by checking the startup logs for Using Marlin kernels.

AWQ

bash

aphrodite run casperhansen/llama-3.3-70b-instruct-awq \
  --quantization awq \
  --port 2242

AWQ gives better perplexity than GPTQ at the same 4-bit compression.

OpenAI and KoboldAI API Endpoints

Aphrodite serves the OpenAI-compatible API on port 2242 by default. The KoboldAI-compatible API requires the --launch-kobold-api flag and is served on the same port 2242:

Endpoint	Port	Enabled By	Protocol
OpenAI-compatible	2242	Default	REST
KoboldAI-compatible	2242	`--launch-kobold-api`	REST

This differs from vLLM's default (port 8000). Update any firewall rules or port mappings accordingly.

OpenAI API

bash

curl http://localhost:2242/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheBloke/Llama-2-70B-Chat-GPTQ",
    "messages": [{"role": "user", "content": "Explain paged attention in one paragraph."}],
    "temperature": 0.7
  }'

KoboldAI API

bash

curl http://localhost:2242/api/v1/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain paged attention in one paragraph.",
    "max_length": 200,
    "temperature": 0.7
  }'

Python client (OpenAI library)

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:2242/v1",
    api_key="none",  # Aphrodite does not require an API key by default
)

response = client.chat.completions.create(
    model="TheBloke/Llama-2-70B-Chat-GPTQ",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

The api_key value is ignored by Aphrodite but required by the OpenAI client library. Set it to any non-empty string.

Per-Request LoRA Hot-Swapping

Aphrodite's per-request LoRA hot-swap works the same way as vLLM's, with the key addition that it is compatible with quantized base models (GPTQ, EXL2). For teams serving community-quantized base models with per-customer fine-tuned adapters, this means you do not need a separate unquantized serving stack. For the full multi-adapter production pattern, see LoRA Multi-Adapter Serving.

Start the server with LoRA support

bash

aphrodite run TheBloke/Llama-2-13B-Chat-GPTQ \
  --quantization gptq \
  --enable-lora \
  --lora-modules \
    customer-a=/adapters/customer_a \
    customer-b=/adapters/customer_b \
  --max-loras 4 \
  --port 2242

--lora-modules registers adapters at startup as alias=path pairs. --max-loras sets how many adapters stay resident in GPU memory simultaneously.

Select adapter per request

python

response = client.chat.completions.create(
    model="customer-a",  # alias maps to customer_a adapter
    messages=[{"role": "user", "content": "Hello"}],
)

The model field in the request maps to the adapter alias. Requests without a matching alias fall back to the base model. Adapter switching is sub-millisecond for cached adapters and tens of milliseconds for a CPU-resident adapter load.

For teams already using quantized backends like GGUF or GPTQ, Aphrodite Engine's per-request LoRA hot-swap pairs natively with those quantized backends without requiring a separate full-precision serving stack.

Advanced Samplers: DRY, XTC, min-p

These samplers are Aphrodite extensions not present in upstream vLLM. They are passed as extra fields in the JSON request body alongside standard parameters.

DRY (Don't Repeat Yourself)

DRY penalizes sequences of tokens that have already appeared in the context, not individual tokens. This is more effective than repetition_penalty for long-form chat where repetitive sentence structures are the problem rather than repeated individual words.

json

{
  "model": "TheBloke/Llama-2-70B-Chat-GPTQ",
  "messages": [{"role": "user", "content": "Tell me a long story."}],
  "temperature": 0.8,
  "dry_multiplier": 0.8,
  "dry_base": 1.75,
  "dry_allowed_length": 2
}

dry_multiplier controls penalty strength. dry_base sets the exponential base for longer matches (longer repeated sequences get penalized more). dry_allowed_length sets the minimum match length to trigger the penalty.

XTC (eXclude Top Choices)

XTC removes tokens above a probability threshold from the sampling pool, with xtc_probability controlling how often this truncation fires. This encourages diversity by forcing the model away from its highest-confidence predictions.

json

{
  "temperature": 0.9,
  "xtc_probability": 0.5,
  "xtc_threshold": 0.1
}

xtc_probability is the chance that token exclusion fires for any given sampling step. xtc_threshold is the minimum probability a token must have to be eligible for exclusion. Useful for creative writing where top-probability continuations tend toward generic phrasing.

min-p

min-p sets a floor on token probability relative to the top-token probability, rather than the absolute threshold that top-p uses. This adapts the effective vocabulary size based on the model's confidence at each step.

json

{
  "temperature": 0.8,
  "min_p": 0.05
}

A min_p of 0.05 keeps any token whose probability is at least 5% of the top token's probability. At high-confidence steps, this is strict; at uncertain steps, more tokens survive. Many users find min-p produces more natural outputs than top-p at the same temperature.

Combined sampler example

json

{
  "model": "TheBloke/Llama-2-70B-Chat-GPTQ",
  "messages": [{"role": "user", "content": "Write a scene."}],
  "temperature": 0.85,
  "min_p": 0.05,
  "dry_multiplier": 0.8,
  "dry_base": 1.75,
  "dry_allowed_length": 2,
  "xtc_probability": 0.3,
  "xtc_threshold": 0.1
}

Samplers stack. DRY runs first on the logits, then XTC exclusion fires probabilistically, then min-p filters, and temperature scaling is applied last. The exact interaction order is documented in the Aphrodite Engine API reference.

Benchmarks and Cost per Token vs vLLM

These are directional figures scaled from published ExLlamaV2 and vLLM benchmarks on equivalent H100 hardware. Actual results vary with batch size, context length, and model architecture.

Throughput: Llama 3.3 70B GPTQ on H100 SXM5

Concurrency	Aphrodite (tok/s)	vLLM (tok/s)	Ratio
1	48	52	0.92x
10	380	410	0.93x
50	1,450	1,620	0.90x

At standard concurrencies, Aphrodite runs roughly 8-10% behind vLLM on pure throughput for the same quantization format. The tradeoff is the broader format support and sampler extensions. For EXL2-specific checkpoints where vLLM is not an option, the comparison is moot.

Cost per million tokens: H100 SXM5 at $2.54/hr on-demand

Format	Throughput (tok/s)	$/1M tokens (on-demand)
BF16 70B	95	$7.43
GPTQ/EXL2 70B	380	$1.86
FP8 70B	720	$0.98

The cost differential between BF16 and GPTQ/EXL2 at 70B scale is the main economic argument for running quantized formats. At $1.86/1M tokens for GPTQ vs $7.43 for BF16, the quality tradeoff is usually worth it for chat applications.

Pricing fluctuates based on GPU availability. The prices above are based on 04 Jul 2026 and may have changed. Check current GPU pricing for live rates.

Scaling and Production Monitoring

Prometheus metrics

Aphrodite exposes a /metrics endpoint compatible with any Prometheus scraper. Key metrics:

Metric	What It Tells You
`aphrodite:num_requests_waiting`	Backlog queue depth; scale up if persistently above 0
`aphrodite:gpu_cache_usage_perc`	KV cache utilization; reduce batch size or add VRAM if above 90%
`aphrodite:num_requests_running`	Active concurrent requests
`aphrodite:avg_generation_throughput_toks_per_s`	Rolling throughput

For GPU-level metrics alongside Aphrodite application metrics, see GPU Monitoring for ML Workloads.

nginx reverse proxy

nginx

upstream aphrodite {
    server 127.0.0.1:2242;
}

server {
    listen 443 ssl;
    server_name your-inference-host.example.com;

    location /v1/ {
        proxy_pass http://aphrodite;
        proxy_read_timeout 120s;
        proxy_set_header Host $host;
    }

    location /health {
        proxy_pass http://aphrodite/health;
    }
}

Multiple instances behind a load balancer

For horizontal scaling, run multiple Aphrodite processes on separate ports and load-balance across them. Each instance loads the same base model and the same LoRA adapter registry independently.

bash

# Instance 1
aphrodite run TheBloke/Llama-2-70B-Chat-GPTQ --quantization gptq --port 2242 &

# Instance 2
aphrodite run TheBloke/Llama-2-70B-Chat-GPTQ --quantization gptq --port 2243 &

A health check on GET /health returns 200 when the server is ready and 503 when it is loading. Use this in your load balancer's health check configuration to avoid routing traffic to instances still loading the model.

Aphrodite Engine's broad quantization support shines on bare-metal GPU instances where there's no hypervisor overhead competing for memory bandwidth. Spheron gives you per-minute billing on GPU instances with full SSH root access.
L40S GPU cloud → | Get started on Spheron →

STEPS / 06

Quick Setup Guide

Provision a GPU instance on Spheron
Log into app.spheron.ai and select your GPU based on model size and quantization format. For EXL2/GPTQ 7-13B models: A100 80GB. For 70B models: H100 SXM5 80GB. For large context or 70B at BF16: H200 SXM5. SSH into the instance and verify GPU access with nvidia-smi.
Install Aphrodite Engine
Install via pip: pip install aphrodite-engine. For Docker: docker pull alpindale/aphrodite-engine:latest. Verify the install with: python -c 'import aphrodite; print(aphrodite.__version__)'.
Launch the inference server
Run: aphrodite run <model-id> --port 2242. For GPTQ: add --quantization gptq. For EXL2: add --quantization exl2. For GGUF: add --quantization gguf. For LoRA hot-swap: add --enable-lora --lora-modules adapter1=path/to/adapter1. The server starts an OpenAI-compatible API on port 2242. To enable the KoboldAI-compatible API on the same port 2242, add --launch-kobold-api.
Connect clients via OpenAI or KoboldAI API
For OpenAI clients: set base_url to http://localhost:2242/v1. For KoboldAI clients: set endpoint to http://localhost:2242/api/v1/generate (requires --launch-kobold-api flag). Test with: curl http://localhost:2242/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "<model-id>", "messages": [{"role": "user", "content": "Hello"}]}'.
Enable advanced samplers
Pass sampling parameters in the request body: dry_multiplier (DRY repetition penalty), xtc_probability and xtc_threshold (XTC sampler), min_p (minimum probability filter), mirostat_mode and mirostat_tau (Mirostat), eta_cutoff. These are appended as extra fields in the JSON body of /v1/chat/completions requests alongside standard temperature and top_p.
Set up monitoring
Aphrodite Engine exposes Prometheus metrics at /metrics. Watch aphrodite:num_requests_waiting for queue depth and aphrodite:gpu_cache_usage_perc for KV cache pressure. Add a health check at GET /health and configure an nginx reverse proxy for production traffic.

FAQ / 05

Frequently Asked Questions

Aphrodite Engine is an open-source LLM serving framework forked from vLLM by PygmalionAI. It extends vLLM's continuous batching and PagedAttention with native support for EXL2 and GGUF quantization formats, advanced samplers (DRY, XTC, min-p, mirostat, eta-cutoff), a KoboldAI-compatible API alongside the OpenAI-compatible API, and per-request LoRA hot-swapping without a server restart. It is licensed under AGPL-3.0, which places copyleft obligations on any service that exposes the engine over a network.

Aphrodite Engine supports EXL2 (via ExLlamaV2 kernels), GGUF (via llama.cpp kernels), GPTQ (both original and GPTQ-Marlin fast-path), AWQ (activation-aware INT4), FP8 (on Hopper/Blackwell Tensor Cores), Marlin mixed-precision, and SqueezeLLM. This breadth makes it the only major serving framework that natively handles both CPU-adjacent formats (GGUF) and GPU-native quant formats (EXL2, GPTQ, AWQ, FP8) in a single server process.

Aphrodite Engine supports loading and switching LoRA adapters on a per-request basis without restarting the inference server. Adapters are registered at startup or dynamically via the API, and each request can specify a different adapter by name in the model field. Multiple adapters remain resident in GPU memory up to a configured cache size. This is the same architectural pattern as vLLM's LoRA serving but with support for more adapter types and compatible with Aphrodite's quantized backends.

Aphrodite Engine supports NVIDIA GPUs from Pascal (GTX 1080, P100) and newer. Hopper GPUs (H100, H200) and Blackwell GPUs (B200, RTX 5090, RTX PRO 6000) are first-class targets with FP8 Tensor Core support. Ada Lovelace (L40S, RTX 4090) and Ampere (A100, A10G) are fully supported. The broad Pascal-and-up support is one of Aphrodite's key advantages for teams with mixed or older GPU fleets.

Yes. Aphrodite Engine exposes two API surfaces: a fully OpenAI-compatible REST API at /v1/chat/completions, /v1/completions, and /v1/models, and a KoboldAI-compatible API at /api/v1/generate. Any client code that already targets vLLM's OpenAI-compatible API or a KoboldAI endpoint works without modification.

TL;DR

What Aphrodite Engine Is and How It Differs From vLLM

When to Choose Aphrodite Engine

Quantization Matrix: EXL2, GGUF, GPTQ, AWQ, Marlin, FP8

Hardware Sizing on Spheron GPU Cloud

Install and Dependency Setup

pip install

Docker

Deploying Models: EXL2, GGUF, GPTQ, and AWQ

EXL2

GGUF

GPTQ (with Marlin auto-detect)

AWQ

OpenAI and KoboldAI API Endpoints

OpenAI API

KoboldAI API

Python client (OpenAI library)

Per-Request LoRA Hot-Swapping

Start the server with LoRA support

Select adapter per request

Advanced Samplers: DRY, XTC, min-p

DRY (Don't Repeat Yourself)

XTC (eXclude Top Choices)

min-p

Combined sampler example

Benchmarks and Cost per Token vs vLLM

Scaling and Production Monitoring

Prometheus metrics

nginx reverse proxy

Multiple instances behind a load balancer

Quick Setup Guide

Provision a GPU instance on Spheron

Install Aphrodite Engine

Launch the inference server

Connect clients via OpenAI or KoboldAI API

Enable advanced samplers

Set up monitoring

Frequently Asked Questions

01What is Aphrodite Engine and how does it differ from vLLM?

02What quantization formats does Aphrodite Engine support?

03How does Aphrodite Engine's per-request LoRA hot-swapping work?

04What GPUs does Aphrodite Engine work on?

05Can I use existing vLLM or KoboldAI client code with Aphrodite Engine?

Build what's next.