# How to Build a Private AI Inference Platform on VPS in 2026: Ubuntu 开发环境 24.04 + Docker + Ollama + Nginx 性能调优 Complete Tutorial
Have you ever worried about data privacy every time you call the ChatGPT or Claude API, or felt the sting of a monthly bill that keeps climbing? Over 18 months of running my own AI inference service, I've cut per-query costs by 92%. This article shares the complete journey—from every pitfall I hit to the working setup I landed on.
This guide is for developers and data scientists with Linux experience who want to run AI inference on their own servers. If you're already running production-grade AI services, you may find this covers the basics and would need to dig deeper into GPU configuration for larger models.
Why Build a Private AI Inference Platform?
I built my first self-hosted AI inference platform out of pure necessity—the bills were killing me. In 2024, my peak monthly spending on third-party APIs exceeded $400, and whenever I pushed the volume up, I hit rate limits. After three weeks of migrating core inference tasks to a self-managed setup, my Q1 2026 average stabilized around $35/month (VPS included).
Core advantages of going self-hosted:
- **Cost**: Llama 3 8B quantized inference costs ~$0.0002/1K tokens vs GPT-4o's $5/1M tokens—a 25,000x difference
- **Privacy**: Sensitive data never leaves your own servers, zero third-party exposure
- **No rate limits**: Perfect for batch processing and training data generation
- **Full control**: Deploy any open-source model, including your own fine-tuned versions
I should be honest about the drawbacks too: you won't get GPT-4o or Claude 3.5-level reasoning and multimodal capabilities. GPU servers add significant cost, and you're responsible for all maintenance.
My Hardware Selection (18 Months of Testing)
I tested five VPS providers and settled on Hetzner's CPX31 (4 vCPU, 8GB RAM, 160GB NVMe, €6.15/month) as my primary inference node. The no-GPU config handles day-to-day inference for Llama 3 8B, Qwen 2.5, and similar mid-sized models just fine.
For running 70B+ models, I'd suggest GPU instances:
- Vultr GPU Cloud (NVIDIA A100, $2.50/hour, metered billing)
- RunPod (billed by the second, ideal for intermittent heavy workloads)
- Lambda Labs (pre-installed frameworks, ready to go)
Pure CPU inference limits to know: an M3 MacBook Pro smooths through Llama 3 70B Q4 quantization; a 4-core/8GB VPS can theoretically run 7B Q8 quantization at 8–15 second average response times.
Step 1: Set Up Ubuntu 24.04 LTS Server
Ubuntu 24.04 LTS is the most recommended server OS in 2026, with official support through April 2029 and the latest OpenSSH and kernel security patches. I upgraded on release day in April 2024 and haven't had a single critical security incident since.
SSH into your VPS and update everything first:
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install core dependencies
sudo apt install -y curl wget git ufw fail2ban nginx certbot python3-certbot-nginx
One pitfall here: Ubuntu 24.04 ships Python 3.12 but no pip. You need to add it manually:
sudo apt install -y python3-pip
pip3 install Docker 容器化部署-compose
Step 2: Install Docker 29.x (Latest Stable)
Docker's latest stable release as of April 2026 is 29.4.0. The official Docker APT repository has a new signing key format on Ubuntu 24.04—you need the updated install procedure:
# Remove old versions if present
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do
sudo apt remove -y $pkg 2>/dev/null
done
# Install Docker GPG key (new path since 2024)
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
# Add Docker repository (architecture-aware)
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# Verify version
docker --version
# Should output: Docker version 29.4.0, build 8f3bd8225e
# Enable on boot and start now
sudo systemctl enable --now docker
sudo systemctl enable --now containerd
⚠️ Lesson learned: Starting Q4 2024, Docker stopped signing docker.io updates. Installing docker.io before docker-ce causes version conflicts. The fix is the exact sequence above—remove first, then install.
Step 3: Install and Configure Ollama (Open-Source AI Inference Engine)
Ollama is the most popular open-source local AI inference tool, supporting one-command deployment of Llama 3, Qwen, Mistral, and other leading open models. I've been running it for 14 months—stability and ease of use have been solid.
# Install Ollama (Linux one-liner)
curl -fsSL https://ollama.com/install.sh | sh
# Verify it's running
systemctl status ollama
# Should show: active (running)
# Pull your first model (Llama 3.1 8B, quantized ~4.7GB)
ollama pull llama3.1:8b
# Quick test
ollama run llama3.1:8b "Hello, introduce yourself in one sentence"
⚠️ Lesson learned: Ollama defaults to listening on localhost:11434—remote access is blocked by design. If you want remote calls (from another machine or through Nginx), you must set OLLAMA_HOST=0.0.0.0. Edit the systemd service file:
sudo systemctl edit ollama
# Under [Service] add:
# Environment="OLLAMA_HOST=0.0.0.0"
# Environment="OLLAMA_MODELS=/mnt/data/ollama-models"
sudo systemctl daemon-reload && sudo systemctl restart ollama
Step 4: Configure Nginx Reverse Proxy (Critical Security Step)
Exposing Ollama's port 11434 directly to the internet is a serious security risk. I've seen cases of publicly exposed AI services being scanned and abused. Always use Nginx as a reverse proxy with HTTP Basic Auth protection.
# Create Nginx config
sudo nano /etc/nginx/sites-available/ollama-proxy
Write this configuration:
server {
listen 80;
server_name your-domain-or-IP;
# Limit body size (AI prompts can be very long)
client_max_body_size 10M;
location / {
# HTTP Basic Auth
auth_basic "Ollama AI Portal";
auth_basic_user_file /etc/nginx/.htpasswd;
# Proxy to local Ollama
proxy_pass http://127.0.0.1:11434;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection keep-alive;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_cache_bypass $http_upgrade;
# Ollama streams output—increase timeout
proxy_read_timeout 300s;
proxy_connect_timeout 75s;
}
}
Generate the HTTP Basic Auth password file:
# apache2-utils provides htpasswd
sudo apt install -y apache2-utils
# Create user (replace your-username with actual username)
sudo htpasswd -c /etc/nginx/.htpasswd your-username
# Enter password twice when prompted
# Enable and test
sudo ln -s /etc/nginx/sites-available/ollama-proxy /etc/nginx/sites-enabled/
sudo nginx -t # Syntax check
sudo systemctl reload nginx
Step 5: Request a Let's Encrypt Free SSL Certificate
# Install Certbot Nginx plugin
sudo apt install -y certbot python3-certbot-nginx
# Issue certificate (requires DNS pointing to this server's IP)
sudo certbot --nginx -d your-domain.com
# Test auto-renewal
sudo certbot renew --dry-run
⚠️ Lesson learned: Let's Encrypt certificates are valid for 90 days. Certbot adds a systemd timer for auto-renewal, but I've seen Ubuntu 24.04 have compatibility issues between systemd-timer and cron that break renewals. Always verify manually:
sudo systemctl status certbot.timer
# Should show: active (waiting)
Step 6: Configure UFW Firewall (Principle of Least Privilege)
Follow least-privilege firewall rules—only open what's absolutely necessary:
# Check current status
sudo ufw status verbose
# Only allow SSH(22), HTTPS(443), HTTP(80)
# Port 11434 should NEVER be exposed!
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow ssh
sudo ufw allow 'Nginx Full' # Covers both 80 and 443
# Enable firewall (only do this AFTER confirming SSH is allowed!)
sudo ufw enable
sudo ufw status numbered
Hard lesson: In early 2025, I accidentally deleted SSH rule #1 with ufw delete 1 and permanently lost SSH access. Had to recover through VNC rescue mode. Now I always open a second SSH session before modifying firewall rules.
Step 7: Set Up Automatic Security Updates (Security Is Ongoing Work)
Security isn't a one-time configuration—it's continuous maintenance. I use unattended-upgrades for automatic security patches:
sudo apt install -y unattended-upgrades
sudo dpkg-reconfigure -plow unattended-upgrades
# Select "Yes" to enable
# Configure update frequency (default: daily check)
sudo nano /etc/apt/apt.conf.d/50unattended-upgrades
# Consider adjusting Updates-frequency to "ma
Step 8: Performance Monitoring and Resource Limits
Llama 3 8B inference uses approximately 4–6GB RAM. If your VPS has limited memory, set Docker resource limits to prevent OOM kills:
# Edit Docker daemon config
sudo nano /etc/docker/daemon.json
Limit Ollama container resources:
{
"storage-driver": "overlay2",
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"default-ulimits": {
"memlock": {
"Name": "memlock",
"Soft": -1,
"Hard": -1
}
}
}
sudo systemctl restart docker
Step 9: API Calls in Practice (Connect to Your AI Workflow)
Your private AI service is now accessible via HTTPS + Auth. Test with curl:
curl -X POST https://your-domain.com/api/generate \
-H "Content-Type: application/json" \
-u your-username:your-password \
-d '{
"model": "llama3.1:8b",
"prompt": "Summarize the core advantages of Docker in 5 bullet points",
"stream": false
}'
Calling from Python (ideal for automation scripts):
import requests
from requests.auth import HTTPBasicAuth
response = requests.post(
"https://your-domain.com/api/generate",
json={
"model": "llama3.1:8b",
"prompt": "Summarize the core advantages of Docker in 5 bullet points",
"stream": False
},
auth=HTTPBasicAuth("your-username", "your-password"),
timeout=60
)
result = response.json()
print(result["response"])
Integrating with LangChain is straightforward:
from langchain_ollama import OllamaLLM
llm = OllamaLLM(
model="llama3.1:8b",
base_url="https://your-domain.com",
callbacks=[...],
tags=["production", "vps-hosted"]
)
Step 10: MiniMax Token Plan as a Complementary Layer
My actual workflow splits tasks: self-hosted Ollama handles 80% of daily inference and batch jobs (low cost, no latency), while MiniMax Token Plan covers the remaining 20%—complex reasoning and creative tasks that genuinely need GPT-4o-level capability. This combination keeps costs down without sacrificing the upper bound of quality.
👉 If you want top-tier AI capability without the premium price tag, MiniMax Token Plan is worth exploring:
👉 Get started: https://platform.minimaxi.com/subscribe/token-plan?code=E5yur9NOub&source=link
Cost Comparison and When to Use Each
| Solution | Monthly Cost | Best For | Not Suitable For |
|---|---|---|---|
| Self-hosted Ollama (CPX31 €6.15/mo) | $7–15 (incl. traffic) | Daily inference, batch tasks, sensitive data | 70B+ models, complex multimodal |
| MiniMax Token Plan | Pay-as-you-go | Complex reasoning, creative generation, code | Ultra high-volume (>1M tokens/day) |
| OpenAI GPT-4o API | $15–500+ | Maximum quality requirements | Cost-sensitive, privacy-sensitive |
My actual split: Llama 3 8B handles 80% of daily tasks (technical docs, code review, data transformation); MiniMax handles the remaining 20% (long-form analysis, creative writing).
Common Questions FAQ
Q: Can an 8GB RAM VPS run Llama 3 70B?
A: Yes with Q4_K_M quantization (~40GB, swap required), but response times typically exceed 30 seconds—poor UX. Get at least 16GB RAM minimum.
Q: Docker install fails with a signature error—what to do?
A: Ubuntu 24.04 changed apt repository signing formats. Follow the exact procedure in "Step 2"—remove old docker.io first, then add the new GPG key.
Q: Nginx proxy times out on API calls—how to fix?
A: Check your proxy_read_timeout setting; 60 seconds is the default but might not be enough. Llama 8B quantized first inference runs 5–15 seconds; subsequent calls use KV cache and are faster. If timeouts persist, try increasing Ollama's context size at startup.
Q: How to back up Ollama models regularly?
A: Model files live in OLLAMA_MODELS (default /var/lib/ollama). Use restic or rsync for periodic backups. Models are typically 4–8GB and don't change often, so frequent backups aren't necessary.
Conclusion
Building a private AI inference platform isn't about replacing all third-party APIs—it's about giving developers another option. For me, it solved three real pain points: cost, privacy, and rate limits. 18 months of production use proves this approach works.
If you want to run high-volume inference cheaply, work on embedded AI, or handle sensitive data you can't upload to third parties, a self-hosted platform is worth the 3–5 hour setup investment. If you need the absolute best model capability, don't want server maintenance, or need strong multimodal features, sticking with commercial APIs like MiniMax makes more sense.
Action step: Start today with a Hetzner CPX31 (€6.15/month) minimal setup, try llama3.1:8b first, and your total starting cost is under $10.
🔗 Related Tech Articles
Deep dive into related technical topics: