GLM-5 Programming Model on Ollama
GLM-5 (from Z.ai) is one of the strongest open-source coding models available on Ollama right now — 744B total parameters, 40B active, scoring 77.8% on SWE-bench Verified. I ran the full installation on a Hetzner Ubuntu 22.04 server and documented every step I went through.
First: Check If Your Hardware Makes the Cut
GLM-5's active parameters are 40B. With Q4 quantization, it needs about 24GB of GPU VRAM. Minimum requirements: an NVIDIA GPU with CUDA 12.1+ and ≥24GB VRAM, or AMD ROCm 7+.
CPU-only is possible but painfully slow. A qwen2.5-coder:7b running on CPU generates roughly 3-5 tokens/second on first token. GLM-5 would be even slower. Fine for demos, not for real work.
# Check NVIDIA driver and CUDA version
nvidia-smi
# Confirm CUDA version ≥12.1
nvcc --version
# Check VRAM
nvidia-smi --query-gpu=memory.total,memory.free --format=csv
If nvidia-smi fails, install the driver first: sudo apt install nvidia-driver-550 then reboot.
Install Ollama (Skip If Already Installed)
Ollama's install is a single command:
curl -fsSL https://ollama.com/install.sh | sh
Verify:
ollama --version
# Output like: ollama version 0.5.4
Start the service (default port 11434):
ollama serve
# Or manage with systemd:
sudo systemctl enable ollama
Download the GLM-5 Model
Pull and run GLM-5 on Ollama:
ollama run glm-5
This downloads roughly 24GB (Q4_K_M quantization). On Hetzner's 1Gbps port, it took about 3-5 minutes. If you don't have enough VRAM (e.g. RTX 3090's 24GB is tight), try a smaller variant:
# Check available GLM-5 tags
ollama show glm-5 --modelfile
# Try a smaller quantization if available
ollama run glm-5:4b # Much lower VRAM, noticeably weaker
Note: GLM-5 tags on Ollama's official library may change with updates. Check https://ollama.com/library/glm-5 for the latest available tags.
First Run: Real Coding Test
Once downloaded, run it directly:
ollama run glm-5 "Implement an LRU cache with expiry support in Python"
Response speed depends on your GPU. On an NVIDIA A10G (24GB), first token arrives in 2-3 seconds, then sustains 15-30 tokens/second. That's faster than GPT-4o API calls with no network latency.
Try a slightly harder task — fixing buggy code:
# Original (buggy)
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2) # misses n=0 case
# Expected: fibonacci(0) should return 0
Send this to GLM-5. The model correctly identified the boundary condition bug and provided a fix. This is what 77.8% on SWE-bench Verified looks like in practice.
The Pitfall I Hit: CUDA Version Mismatch
One error I encountered during setup:
Error: CUDA version mismatch. Ollama requires CUDA 12.1+, but found 11.4
Fix by upgrading CUDA Toolkit:
# Ubuntu 22.04 upgrade CUDA
sudo apt update
sudo apt install cuda-toolkit-12-4
export PATH=/usr/local/cuda-12.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
Add those two exports to ~/.bashrc or ~/.profile to make them permanent. A cleaner approach: use Docker to fully isolate the CUDA environment:
docker run -d \
--gpus '"device=0"' \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama:latest
# Then run inside the container:
docker exec -it ollama ollama run glm-5
GLM-5 vs Qwen2.5-Coder: Which One?
I compared GLM-5 and Qwen2.5-Coder 32B on several coding tasks:
| Dimension | GLM-5 | Qwen2.5-Coder 32B |
|---|---|---|
| SWE-bench | 77.8% (Verified) | ~65% (reference) |
| Multilingual | English + Chinese | 40+ languages |
| VRAM needed | 24GB (Q4) | 20GB (Q4) |
| Speed (A10G) | 15-25 tok/s | 20-35 tok/s |
| License | MIT | Apache 2.0 |
Bottom line: for English-Chinese bilingual coding tasks, GLM-5's 77.8% SWE-bench score is the clear advantage. For more exotic languages or tighter VRAM budgets, Qwen2.5-Coder is more flexible. The Ollama installation experience is similar for both.
Run GLM-5 as a Persistent API Service
For production deployment, expose GLM-5 as an API instead of using ollama run interactively:
# Start the API server
ollama serve &
# API call example
curl http://localhost:11434/api/generate -d '{
"model": "glm-5",
"prompt": "Explain what HTTP/3 QUIC protocol is",
"stream": false
}'
Put Nginx 性能调优 in front as a reverse proxy and you get a GPT-4o-equivalent API experience with all data staying local.
Bottom Line
The biggest value of running GLM-5 locally is data privacy and zero API costs. Code snippets never leave your server, which matters significantly when working with proprietary codebases. Ollama support for GLM-5 is solid now — much better than a year ago. Worth trying directly.
👉 For more powerful AI coding, explore MiniMax API services featuring GLM-5 and other large models:https://platform.minimaxi.com/subscribe/token-plan?code=E5yur9NOub&source=link
Related Reading
🔗 Related Tech Articles
Deep dive into related technical topics: