Getting from a working localhost demo to a scaled, reliable production AI app involves a dozen moving parts. This guide covers all of them — containerization, LLM serving, autoscaling, CI/CD, monitoring, and cost optimization. Step-by-step with real commands.
Most AI apps never make it to production. Not because the model is bad or the idea is wrong — but because AI app deployment is surprisingly hard. Getting from a working localhost demo to a scaled, reliable, cost-efficient production system involves a dozen moving parts that most tutorials skip entirely.
This guide covers all of them. By the end you'll know exactly how to take any AI app — an LLM-powered API, a RAG pipeline, or an AI agent — and get it running in production with autoscaling, secrets management, CI/CD, monitoring, and controlled costs. No Kubernetes expertise required.
Most developers have experienced this: you build an AI app on your laptop, it works beautifully, and then you try to share it with anyone and everything falls apart. The model takes 30 seconds to respond on a cold start. The API key is hardcoded. There's no way to handle two users at once. This is the prototype-to-production gap, and it's wider for AI apps than for any other type of software.
Traditional web apps have mature, well-documented deployment paths. AI apps don't — or at least, they didn't until recently. The tooling has finally caught up, but you need to know which pieces to use and in what order.
Four things make AI apps harder to deploy than conventional applications:
Model weights are large. A 7B parameter model is roughly 14GB in fp16. A 70B model is ~140GB. Your container images aren't megabytes anymore — they're gigabytes, which changes how you think about builds, cold starts, and registry storage.
GPU requirements are non-negotiable for serious inference. You can run small models on CPU, but anything over a few billion parameters at real throughput needs a GPU. That means your deployment platform must support GPU instances, and your orchestration must handle GPU scheduling correctly.
Cold starts are expensive. Loading a 13B model into GPU VRAM takes 30–90 seconds. A user hitting your API while the instance is cold will wait that long before getting a response. Managing cold starts — through minimum replicas, warm pools, or preloading — is a first-class concern in AI deployment that doesn't exist for stateless web apps.
Outputs are non-deterministic. You can't just check "did it return 200?" You need to monitor response quality, detect drift, and handle cases where the model produces something valid but wrong.
A production AI app isn't just a model behind an API. It typically has:
Inference API — the endpoint that serves model predictions
Background workers — async tasks like embeddings generation, document processing, or agent loops
Secrets vault — encrypted storage for API keys, database credentials, model provider tokens
Custom domain + TLS — a real URL, not an auto-generated one
CI/CD pipeline — automated deploys on every push to main
Observability — logs, metrics, health checks, and alerts
Getting all of these working together is what "deploying an AI app" actually means.
Containerization is the foundation of every reliable AI deployment. Docker gives you reproducibility: if it runs in your container locally, it runs the same way in production.
The key difference from a standard Python Dockerfile is the base image. For GPU inference, you need a CUDA-enabled base image that matches the CUDA version your inference framework expects.
dockerfile
Copy
# Use a CUDA-enabled base for GPU inference
FROM nvidia/cuda:12.1-cudnn8-runtime-ubuntu22.04
WORKDIR /app
# Install Python
RUN apt-get update && apt-get install -y python3.11 python3-pip && \
rm -rf /var/lib/apt/lists/*
# Install dependencies first (layer cache optimization)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy app code
COPY . .
# Expose inference port
EXPOSE 8000
CMD ["python3", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]For CPU-only inference (small models, embeddings, classification):
dockerfile
Copy
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]You have three options for where model weights live:
| Strategy | How it works | Best for | |---|---|---| | Download on start | App pulls weights from HuggingFace Hub at startup | Development, infrequent deploys | | Bake into image | COPY weights into the container at build time | Small models, air-gapped environments | | Mount from volume | Weights stored on persistent volume, mounted at runtime | Large models (13B+), fast cold starts |
For production with large models, download-on-start with a warm cache layer is the most practical. Cache the model directory to a persistent volume so subsequent starts skip the download:
python
Copy
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_CACHE = "/cache/models" # mounted persistent volume
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
cache_dir=MODEL_CACHE,
torch_dtype="auto",
device_map="auto"
)CUDA driver compatibility is one of the most common sources of deploy failures. The rule: your container's CUDA toolkit version must be equal to or lower than the driver version on the host GPU.
bash
Copy
# Check host driver version
nvidia-smi
# Verify CUDA version in container
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvcc --versionWhen in doubt, use CUDA 12.1 — it has the widest compatibility with current H100/A100/L40S GPU hosts.
For deploy AI model production scenarios where you need high throughput and low latency, a raw FastAPI wrapper around a HuggingFace model will get you started but won't scale. Dedicated LLM serving frameworks add continuous batching, KV cache management, and OpenAI-compatible APIs that dramatically improve performance.
vLLM is the current standard for high-throughput LLM inference. Its PagedAttention algorithm manages GPU memory more efficiently than naive implementations, enabling continuous batching that can serve 10x more requests per second on the same hardware.
bash
Copy
# Install
pip install vllm
# Serve Mistral 7B with an OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90vLLM exposes a /v1/completions and /v1/chat/completions endpoint — identical to the OpenAI API. Any code written against the OpenAI SDK works against vLLM without changes.
TGI is Hugging Face's production inference server. It's purpose-built for transformer models and handles model sharding across multiple GPUs natively.
bash
Copy
docker run --gpus all \
-v $HOME/.cache/huggingface:/data \
-p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mistral-7B-Instruct-v0.2 \
--max-input-length 2048 \
--max-total-tokens 4096BentoML wraps your model into a self-contained service with built-in batching, adaptive concurrency, and a deployment-ready artifact called a Bento. It's the right choice when you need to chain multiple models (embedding → reranker → LLM) in a single deployment unit.
If you're calling a hosted model API (OpenAI, Anthropic, Google) rather than running your own weights, you don't need a serving framework. A standard FastAPI or Express app calling the API directly is perfectly correct:
python
Copy
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
@app.post("/chat")
async def chat(message: str):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": message}]
)
return {"reply": response.choices[0].message.content}Not all cloud platforms handle AI workloads well. Before committing, evaluate:
GPU availability — can you get H100, A100, or L40S instances on demand?
Autoscaling — does it handle GPU cold starts intelligently, or just terminate and restart instances?
Secrets management — is there a built-in vault, or do you have to integrate HashiCorp Vault or AWS Secrets Manager yourself?
Pricing transparency — are egress fees, GPU idle time, and per-request charges clearly documented?
CI/CD support — can you integrate with GitHub Actions without a custom plugin?
Time to first deploy — how many steps from code to running URL?
| Platform | GPU Support | Autoscaling | Secrets Built-in | Time to Deploy | Best For | |---|---|---|---|---|---| | NEXUS AI | ✅ Multi-cloud GPU | ✅ GPU-aware | ✅ Encrypted vault | ~2 min | Full AI apps, any stack | | AWS SageMaker | ✅ Full AWS GPU fleet | ✅ Complex config | ⚠️ Via IAM/SSM | 20–45 min | Enterprise, existing AWS | | Modal | ✅ On-demand GPU | ✅ Serverless | ⚠️ Secrets via SDK | ~5 min | Python functions, batch | | Replicate | ✅ Managed GPUs | ✅ Serverless | ❌ No vault | ~10 min | Public model hosting | | Render | ❌ CPU only | ⚠️ Basic | ⚠️ Env vars only | ~5 min | Lightweight AI apps | | Railway | ❌ CPU only | ⚠️ Basic | ⚠️ Env vars only | ~3 min | Prototyping |
Prototype / solo developer — Use Railway or Render for CPU workloads. Use Modal for GPU inference if you're comfortable with Python decorators. Prioritize speed over everything else.
Scaling startup (Series A–B, 10K–100K users/day) — You need GPU autoscaling, encrypted secrets, custom domains, and CI/CD that doesn't require a dedicated DevOps engineer. NEXUS AI is built for this stage.
Enterprise — AWS SageMaker or Google Vertex AI if you're already in those ecosystems and have a platform team. The operational complexity is manageable at scale with dedicated infrastructure resources.
NEXUS AI is designed so a single developer can deploy a complete AI app — GPU inference, secrets, custom domain, CI/CD — without infrastructure expertise. The CLI is the fastest path from container to production.
Linux:
curl -fsSL https://nexusai.run/install.sh | bashmacOS (Intel + Apple Silicon):
curl -fsSL https://nexusai.run/install-mac.sh | bashManual install via npm (if you already have Node.js 18+):
npm install -g nexusapp-cliVerify:
nexus --version# Log in — opens browser for authentication
nexus auth login
# Confirm who you are
nexus auth whoami
# Deploy a container image
nexus deploy create \
--name ai-api \
--image your-org/ai-api:latest \
--port 8000 \
--provider gcp_cloud_run \
--env NODE_ENV=productionThe --wait flag blocks until the deployment goes live:
nexus deploy create \
--name ai-api \
--image your-org/ai-api:latest \
--port 8000 \
--provider gcp_cloud_run \
--waitNever put API keys in environment variables passed on the command line. Use the secrets vault:
# Create encrypted secrets
nexus secret create OPENAI_API_KEY --deployment ai-api
nexus secret create DATABASE_URL --deployment ai-api
nexus secret create ANTHROPIC_API_KEY --deployment ai-api
# List secrets (values are never shown)
nexus secret list --deployment ai-apinexus domain add api.yourcompany.com --deployment ai-apiThe CLI returns the DNS record to add (a CNAME pointing to the NEXUS AI edge). TLS is provisioned automatically.
# Follow live output
nexus deploy logs ai-api --follow
# Get the last 100 lines
nexus deploy logs ai-api --tail 100Standard web app autoscaling reacts to CPU or request count. For AI workloads, these signals are often wrong. A model can peg GPU utilization at 100% even when it's idle between inferences. A single large request can take 30 seconds while a queue of 50 requests builds up.
The smarter signal is queue depth — how many requests are waiting. Scale out when the queue grows, scale in when it drains.
Scale-to-zero saves cost when traffic is low or intermittent. The downside is the cold start penalty: when a new instance starts, it takes 30–90 seconds to load a large model into VRAM before it can serve the first request.
Minimum replicas > 0 eliminates cold starts but incurs idle cost. The right tradeoff:
| Scenario | Recommended config | |---|---| | Development / low traffic | Min 0, max 3 — accept cold starts | | B2C product with latency SLA | Min 1, max 10 — always one warm replica | | High-traffic API | Min 2, max 20 — no cold starts, headroom for spikes |
Configure your deployment to scale based on concurrent request pressure rather than CPU:
nexus deploy scale ai-api \
--min 1 \
--max 10 \
--target-concurrency 5--target-concurrency 5 means: add a replica whenever a single instance is handling more than 5 concurrent requests. For LLM inference where each request may take 2–10 seconds, a concurrency target of 3–8 is typically right.
Create .github/workflows/deploy.yml in your repo:
name: Deploy to NEXUS AI
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to container registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Install NEXUS AI CLI
run: npm install -g nexusapp-cli
- name: Deploy to production
env:
NEXUSAI_TOKEN: ${{ secrets.NEXUSAI_TOKEN }}
run: |
nexus deploy create \
--name ai-api \
--image ghcr.io/${{ github.repository }}:${{ github.sha }} \
--port 8000 \
--provider gcp_cloud_run \
--waitIn your GitHub repository: Settings → Secrets and variables → Actions → New repository secret
Add NEXUSAI_TOKEN with your token from:
nexus auth whoami # shows token ID
# Or generate a CI token:
nexus auth login --token nxk_your_ci_token_hereNEXUS AI uses a rolling deployment strategy by default: new instances start and pass health checks before old instances are terminated. To trigger a redeploy of the same image (e.g., after a secrets change):
nexus deploy redeploy ai-api --waitTo roll back to the previous version:
nexus deploy rollback ai-apiStandard infrastructure metrics (CPU, memory, request rate) don't tell the full story. Track these AI-specific metrics:
| Metric | What it measures | Good target | |---|---|---| | TTFT (Time to First Token) | Latency until first token streams | < 500ms at p95 | | Throughput | Tokens per second per GPU | Model-dependent | | Queue depth | Waiting requests | < 10 for real-time APIs | | GPU utilization | Active inference time | 60–80% for cost efficiency | | Error rate | 4xx + 5xx responses | < 0.1% | | Token budget exceeded | Requests hitting context limit | Track separately |
Add a /health endpoint that NEXUS AI uses to determine if an instance is ready to serve traffic:
@app.get("/health")
async def health():
# Check model is loaded
if model is None:
raise HTTPException(status_code=503, detail="Model not loaded")
return {"status": "ok", "model": MODEL_NAME}The platform polls this endpoint every 10 seconds. Instances that fail health checks are replaced automatically.
Log requests and responses in structured JSON so you can query and alert on them:
import json, logging, time
logger = logging.getLogger(__name__)
@app.post("/chat")
async def chat(request: ChatRequest):
start = time.time()
response = await run_inference(request)
latency_ms = (time.time() - start) * 1000
logger.info(json.dumps({
"event": "inference",
"model": MODEL_NAME,
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"latency_ms": round(latency_ms, 2),
"finish_reason": response.choices[0].finish_reason,
}))
return responseStream logs in real time:
nexus deploy logs ai-api --followAI workloads are expensive if left unmanaged. The four main cost drivers:
GPU compute — billed per second of active use. An H100 runs ~$2.74/hr; an A100 ~$1.80/hr; an L40S ~$1.32/hr.
Idle replicas — minimum replica counts keep GPUs warm but cost money even at zero traffic.
Egress fees — data transferred out of your cloud region. Large model responses and streaming add up.
Cold start overhead — models reloading from scratch wastes GPU time and user patience.
| App size | Traffic | Config | Est. monthly cost | |---|---|---|---| | Prototype | < 100 req/day | 1× L40S, min 0, max 1 | ~$50–$120 | | Growing product | 10K req/day | 2× A100, min 1, max 5 | ~$800–$1,500 | | Production scale | 100K req/day | 4× H100, min 2, max 20, semantic cache | ~$3,000–$6,000 |
These are estimates for GPU compute only. Add ~10–15% for egress, storage, and DNS.
Many real-world AI apps receive semantically similar queries repeatedly. A semantic cache stores embeddings of past queries and returns cached responses for near-matches, bypassing the model entirely.
import hashlib
from redis import Redis
redis = Redis.from_url(os.environ["REDIS_URL"])
async def cached_inference(prompt: str, similarity_threshold: float = 0.92):
# Check exact cache first
cache_key = hashlib.sha256(prompt.encode()).hexdigest()
if cached := redis.get(cache_key):
return json.loads(cached)
# Run inference and cache
result = await run_inference(prompt)
redis.setex(cache_key, 3600, json.dumps(result))
return resultFor semantic (fuzzy) caching, libraries like GPTCache or Langchain's RedisSemanticCache handle embedding comparison automatically.
For non-latency-sensitive work — batch embeddings, document processing, fine-tuning runs — use spot/preemptible GPU instances at 60–80% lower cost:
nexus deploy create \
--name embeddings-worker \
--image your-org/embeddings:latest \
--provider gcp_cloud_run \
--spot \
--env BATCH_MODE=trueSpot instances can be interrupted. Design batch workers to checkpoint their state and resume after preemption.
AI agents have different infrastructure requirements than stateless inference endpoints. They're worth treating as a separate deployment pattern.
A typical inference endpoint receives a request, runs the model, and returns a response — the whole thing takes < 1 second. An agent loop is different:
Long-running — a single agent run may take 30 seconds to 10 minutes
Stateful — the agent maintains context across tool calls
Event-driven — agents may be triggered by webhooks, scheduled events, or messages from a queue
Tool-calling — agents call external APIs, databases, and other services mid-run
A standard HTTP request/response pattern doesn't fit. Agent deployments need a background worker architecture.
Copy
[Trigger] [Queue] [Worker] [State]
Webhook ──────► Redis/SQS ──────► Agent Loop ──────► Redis/Postgres
API call (task queue) (background job) (conversation state)
SchedulerDeploy two separate services:
1. API receiver — accepts incoming triggers, enqueues jobs, returns a job ID immediately:
nexus deploy create \
--name agent-api \
--image your-org/agent-api:latest \
--port 3000 \
--provider gcp_cloud_run2. Worker — long-running process that pulls from the queue and executes agent loops:
nexus deploy create \
--name agent-worker \
--image your-org/agent-worker:latest \
--provider gcp_cloud_run \
--env QUEUE_URL=redis://... \
--min 1 \
--max 5Agent tool calls fail. External APIs go down, rate limits hit, and network errors happen. Build retries into every tool:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_tool(tool_name: str, args: dict):
return await TOOLS[tool_name](**args)Set a hard timeout on the entire agent run to prevent runaway jobs from consuming GPU time indefinitely:
import asyncio
async def run_agent_with_timeout(task: str, timeout_seconds: int = 300):
try:
return await asyncio.wait_for(run_agent(task), timeout=timeout_seconds)
except asyncio.TimeoutError:
return {"error": "Agent run exceeded time limit", "partial": get_partial_result()}Standard request logging doesn't capture agent behavior. Log each step of the agent loop as a structured trace:
logger.info(json.dumps({
"event": "agent_step",
"run_id": run_id,
"step": step_number,
"action": tool_name,
"input": tool_input,
"output_summary": str(result)[:200],
"elapsed_ms": elapsed,
}))This gives you a complete trace of every agent run that you can query in your log aggregator.
Before going live, verify every item:
No secrets in environment variables — use the encrypted secrets vault for all API keys, tokens, and database credentials
Health check endpoint — /health returns 200 only when the model is fully loaded and ready
Graceful shutdown — handle SIGTERM to finish in-flight requests before the container exits
Minimum 1 replica for latency-sensitive APIs — never accept cold starts on a user-facing endpoint
Structured JSON logging — every request and response logged with timestamps, token counts, and latency
CI/CD pipeline — no manual deploys; every push to main runs tests then deploys automatically
Rollback tested — run nexus deploy rollback in staging before you ever need it in production
Semantic caching for repeated queries — implement even a simple exact-match cache before launch
TTFT monitored and alerted — set an alert if p95 TTFT exceeds your SLA (typically 500ms–2s)
Egress costs estimated — calculate expected monthly egress based on average response size × daily request volume before you're surprised by the bill
With the NEXUS AI CLI, the time from a tagged container image to a live production URL is typically 2–5 minutes. That includes image pull, container start, model loading, and health check confirmation. First-time deploys take slightly longer; redeployments of the same image take under 2 minutes.
No. Kubernetes is powerful but operationally expensive — you need a dedicated platform engineer to manage it safely. NEXUS AI handles container orchestration, autoscaling, service mesh, and load balancing without exposing Kubernetes primitives. If your organization already runs a Kubernetes cluster and has the expertise to manage it, BYOC (Bring Your Own Cloud) lets you deploy into your own cluster through the same CLI.
Costs depend on GPU tier, traffic volume, and caching strategy. A prototype running an open-source 7B model on an L40S GPU with scale-to-zero runs roughly $50–$120/month at low traffic. A production API serving 10,000 requests/day on two A100 replicas costs approximately $800–$1,500/month before optimizations. Semantic caching on repeated queries can reduce inference costs by 40–70%.
Llama 3 70B in fp16 requires approximately 140GB of VRAM. That means either two H100 80GB GPUs or four A100 40GB GPUs in tensor-parallel mode. For most production use cases, running Llama 3 8B (which fits in a single A100 or L40S) is the better choice — the 8B model handles the majority of real-world tasks well, at one-eighth the infrastructure cost.
MLOps is the set of practices for training, versioning, and deploying traditional ML models (classifiers, regressors, recommendation systems). LLMOps extends these practices for large language models, adding concerns that don't exist in classical ML: prompt versioning, context window management, TTFT optimization, hallucination monitoring, and the operational complexity of serving multi-billion parameter models at scale. The two overlap significantly in CI/CD, monitoring, and infrastructure patterns.
TTFT (Time to First Token) is the latency between when a user submits a request and when the first token of the response begins streaming. It's the primary latency metric for LLM APIs because users experience it directly — a low TTFT makes the app feel responsive even when total generation takes several seconds. A good TTFT target for a user-facing product is under 500ms at p95. For internal APIs where streaming isn't exposed, p95 total latency under 3 seconds is a reasonable goal.
With the NEXUS AI CLI:
nexus deploy rollback ai-apiThis immediately routes traffic back to the previous container image. The failed deployment is stopped but not deleted — you can re-examine its logs with
nexus deploy logs ai-api --revision previous. NEXUS AI keeps the last three deployment revisions available for rollback.
Yes, for certain workloads. Embedding models, small classifiers, and lightweight text processing run well on CPU. For LLM inference, CPU-only deployment is practical for models up to ~3B parameters if you can tolerate 5–30 seconds per response. Anything larger or any latency-sensitive application needs a GPU. The NEXUS AI CLI defaults to CPU instances;
add --provider gcp_cloud_run with a GPU-enabled machine type for GPU workloads.
AI app deployment in 2026 doesn't have to mean weeks of Kubernetes configuration, IAM policy debugging, and container registry setup. The tools exist to go from a working container to a scaled, monitored, CI/CD-enabled production deployment in an afternoon.
Install the NEXUS AI CLI, push your first deployment, and get a live URL in minutes:
curl -fsSL https://nexusai.run/install.sh | bash
nexus auth login
nexus deploy create --name my-ai-app --image your-org/app:latest --port 8000The full CLI reference, autoscaling configuration guide, and CI/CD templates are in the NEXUS AI docs.
0
17
0