GLM-5 Programming Model on Ollama

GLM-5Ollamalocal deploymentopen-source coding modelZ.ai

GLM-5 (from Z.ai) is one of the strongest open-source coding models available on Ollama right now — 744B total parameters, 40B active, scoring 77.8% on SWE-bench Verified. I ran the full installation on a Hetzner Ubuntu 22.04 server and documented every step I went through.

First: Check If Your Hardware Makes the Cut

GLM-5's active parameters are 40B. With Q4 quantization, it needs about 24GB of GPU VRAM. Minimum requirements: an NVIDIA GPU with CUDA 12.1+ and ≥24GB VRAM, or AMD ROCm 7+.

CPU-only is possible but painfully slow. A qwen2.5-coder:7b running on CPU generates roughly 3-5 tokens/second on first token. GLM-5 would be even slower. Fine for demos, not for real work.

# Check NVIDIA driver and CUDA version
nvidia-smi
# Confirm CUDA version ≥12.1
nvcc --version

# Check VRAM
nvidia-smi --query-gpu=memory.total,memory.free --format=csv

If nvidia-smi fails, install the driver first: sudo apt install nvidia-driver-550 then reboot.

Install Ollama (Skip If Already Installed)

Ollama's install is a single command:

curl -fsSL https://ollama.com/install.sh | sh

Verify:

ollama --version
# Output like: ollama version 0.5.4

Start the service (default port 11434):

ollama serve
# Or manage with systemd:
sudo systemctl enable ollama

Download the GLM-5 Model

Pull and run GLM-5 on Ollama:

ollama run glm-5

This downloads roughly 24GB (Q4_K_M quantization). On Hetzner's 1Gbps port, it took about 3-5 minutes. If you don't have enough VRAM (e.g. RTX 3090's 24GB is tight), try a smaller variant:

# Check available GLM-5 tags
ollama show glm-5 --modelfile

# Try a smaller quantization if available
ollama run glm-5:4b  # Much lower VRAM, noticeably weaker

Note: GLM-5 tags on Ollama's official library may change with updates. Check https://ollama.com/library/glm-5 for the latest available tags.

First Run: Real Coding Test

Once downloaded, run it directly:

ollama run glm-5 "Implement an LRU cache with expiry support in Python"

Response speed depends on your GPU. On an NVIDIA A10G (24GB), first token arrives in 2-3 seconds, then sustains 15-30 tokens/second. That's faster than GPT-4o API calls with no network latency.

Try a slightly harder task — fixing buggy code:

# Original (buggy)
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)  # misses n=0 case
# Expected: fibonacci(0) should return 0

Send this to GLM-5. The model correctly identified the boundary condition bug and provided a fix. This is what 77.8% on SWE-bench Verified looks like in practice.

The Pitfall I Hit: CUDA Version Mismatch

One error I encountered during setup:

Error: CUDA version mismatch. Ollama requires CUDA 12.1+, but found 11.4

Fix by upgrading CUDA Toolkit:

# Ubuntu 22.04 upgrade CUDA
sudo apt update
sudo apt install cuda-toolkit-12-4
export PATH=/usr/local/cuda-12.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH

Add those two exports to ~/.bashrc or ~/.profile to make them permanent. A cleaner approach: use Docker to fully isolate the CUDA environment:

docker run -d \
  --gpus '"device=0"' \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama:latest

# Then run inside the container:
docker exec -it ollama ollama run glm-5

GLM-5 vs Qwen2.5-Coder: Which One?

I compared GLM-5 and Qwen2.5-Coder 32B on several coding tasks:

Dimension	GLM-5	Qwen2.5-Coder 32B
SWE-bench	77.8% (Verified)	~65% (reference)
Multilingual	English + Chinese	40+ languages
VRAM needed	24GB (Q4)	20GB (Q4)
Speed (A10G)	15-25 tok/s	20-35 tok/s
License	MIT	Apache 2.0

Bottom line: for English-Chinese bilingual coding tasks, GLM-5's 77.8% SWE-bench score is the clear advantage. For more exotic languages or tighter VRAM budgets, Qwen2.5-Coder is more flexible. The Ollama installation experience is similar for both.

Run GLM-5 as a Persistent API Service

For production deployment, expose GLM-5 as an API instead of using ollama run interactively:

# Start the API server
ollama serve &

# API call example
curl http://localhost:11434/api/generate -d '{
  "model": "glm-5",
  "prompt": "Explain what HTTP/3 QUIC protocol is",
  "stream": false
}'

Put Nginx 性能调优 in front as a reverse proxy and you get a GPT-4o-equivalent API experience with all data staying local.

Bottom Line

The biggest value of running GLM-5 locally is data privacy and zero API costs. Code snippets never leave your server, which matters significantly when working with proprietary codebases. Ollama support for GLM-5 is solid now — much better than a year ago. Worth trying directly.

👉 For more powerful AI coding, explore MiniMax API services featuring GLM-5 and other large models:https://platform.minimaxi.com/subscribe/token-plan?code=E5yur9NOub&source=link

📚 Related Reading