Docker 容器化部署 GPU Configuration Errors and Solutions
This article contains affiliate links. I earn a small commission at no extra cost to you.
📌 This article was AI-assisted generated and human-reviewed | TechPassive — An AI-driven content testing site focused on real tool reviews
Running deep learning models inside Docker containers requires GPU access. Sounds simple, but the configuration has traps everywhere — it took me 3 full days to get the entire pipeline working. Here's everything I learned, with reproducible commands and verification steps.
---
Understand the Full Stack First
Getting Docker containers to access a GPU requires 5 layers, and any one of them can fail:
Application layer (docker run --gpus)
→ Docker CLI (parses --gpus flag)
→ Docker Daemon (configures runtime)
→ nvidia-container-toolkit (translates Docker request to NVIDIA call)
→ NVIDIA Driver (kernel-level GPU driver)
→ Physical GPU (CUDA computation)
If any layer is missing or misconfigured, you get errors like could not select device driver.
---
Trap 1: nvidia-container-toolkit Not Installed
**Symptom**: docker run --gpus all nvidia-smi fails with:
Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
Diagnostic:
# Check if toolkit is installed
which nvidia-container-toolkit
nvidia-ctk --version
# If no output, it's not installed
Solution (Ubuntu 开发环境/Debian):
# Add NVIDIA repo
distribution=$(. /etc/os-release &&echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verification:
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi
If GPU info appears, you're good.
---
Trap 2: daemon.json Configuration Conflicts
Symptom: Toolkit installed but still failing, or Docker fails to start.
**Common mistake**: Adding "exec-opts": ["native.cgroupdriver=cgroupfs"] to /etc/docker/daemon.json, which conflicts with cgroup v2 systems.
Diagnostic:
cat /etc/docker/daemon.json
docker info | grep -i cgroup
Correct configuration (cgroup v2 systems, Ubuntu 22.04+ default):
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"args": []
}
}
}
For cgroup v1 systems with systemd:
{
"exec-opts": ["native.cgroupdriver=cgroupfs"],
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"args": []
}
}
}
Critical: After editing daemon.json, restart Docker:
sudo systemctl restart docker
Check if runtime is active:
docker info | grep -A10 Runtimes
---
Trap 3: Runtime Priority Conflicts
**Symptom**: Docker uses nvidia runtime but other runtimes aren't called correctly.
**Problem**: Some systems default to runc instead of nvidia, breaking GPU passthrough.
Solutions:
# Method 1: Use --gpus flag (recommended, most flexible)
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi
# Method 2: Set nvidia as default runtime
# Edit /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime"
}
}
}
**Note**: Since Docker 19.03, prefer --gpus all over --runtime=nvidia. The old syntax is deprecated.
---
Trap 4: CUDA Version / Driver Version Mismatch
**Symptom**: nvidia-smi works fine on the host, but container reports CUDA incompatibility.
Common mismatches:
| Driver Version | Max CUDA Version | Recommended Image |
|---|---|---|
| >= 535 | CUDA 12.x | `nvidia/cuda:12.6.1-base-ubuntu22.04` |
| 470.x - 535.x | CUDA 11.x | `nvidia/cuda:11.8.0-base-ubuntu20.04` |
| < 470 | CUDA 10.x | `nvidia/cuda:10.2-base-ubuntu18.04` |
Diagnostic:
# Check host driver version
nvidia-smi | grep "Driver Version"
# Check host CUDA version
nvcc --version
# Check available CUDA in container
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvcc --version
**Fix**: Choose the right image for your driver version. Using a newer CUDA image with an older driver will give you cuda error: cuda version not supported.
---
Trap 5: GPU Not Visible Inside Container (Device Node Permission)
**Symptom**: nvidia-smi inside container reports NVML: Permission denied.
**Cause**: GPU device nodes (/dev/nvidia*) have insufficient permissions.
Diagnostic:
# Check device nodes on host
ls -la /dev/nvidia*
# Check current user groups
groups
Solutions:
# Method 1: --privileged (testing only, not for production)
docker run --rm --gpus all --privileged nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi
# Method 2: Mount devices with explicit permissions (recommended)
docker run --rm \
--gpus all \
--device=/dev/nvidia0:/dev/nvidia0 \
--device=/dev/nvidia-uvm:/dev/nvidia-uvm \
--device=/dev/nvidiactl:/dev/nvidiactl \
-v /usr/bin/nvidia-smi:/usr/bin/nvidia-smi \
nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi
# Method 3: Ensure user is in correct groups (recommended for production)
sudo usermod -aG video $USER
sudo usermod -aG docker $USER
# Then re-login
---
Quick Verification: Layer-by-Layer
# 1. Host layer: NVIDIA Driver
nvidia-smi
# Expected: shows GPU model, temperature, memory
# 2. Container Toolkit layer
nvidia-ctk --version
# Expected: version number output
# 3. Docker runtime layer
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi
# Expected: GPU info displayed inside container
# 4. Application layer (PyTorch)
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 \
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"None\"}')"
# Expected: CUDA available: True, GPU model name
---
Configuration Order Summary
1. Confirm host NVIDIA Driver works → nvidia-smi
2. Install nvidia-container-toolkit
3. Configure daemon.json (don't blindly add cgroupfs)
4. Restart Docker
5. Test with --gpus all (not --runtime=nvidia)
6. Match CUDA image version to driver version
When you hit an error, use the layer-by-layer verification above to pinpoint which component is failing — don't randomly reinstall things.
---
Recommended GPU Cloud Servers:
👉 Sign up now: https://platform.minimaxi.com/subscribe/token-plan?code=E5yur9NOub&source=link
🔗 Related Tech Articles
Deep dive into related technical topics: