Getting GPU-Accelerated PaddlePaddle Working in Docker on QNAP NAS
If you’ve landed here, you’re probably staring at paddle device: cpu when you expected gpu:0, wondering why your NVIDIA GPU isn’t being recognized inside a Docker container on your QNAP NAS. This guide documents exactly what it takes to get full CUDA acceleration working — including PaddlePaddle, PaddleOCR, PaddleX, and onnxruntime-gpu — on a QNAP running QuTS hero h5.3.x.
This is not a simple setup. QNAP’s GPU driver architecture has some quirks that aren’t well documented, and the path to a working setup requires understanding a few key pieces. But once it’s running, it’s rock solid.
My Setup
-
NAS: QNAP TS-h1277AXU-RP
-
GPU: NVIDIA RTX 2000 Ada Generation (16GB VRAM)
-
OS: QuTS hero h5.3.3
-
Driver: 575.64.05, CUDA 12.9
-
Container Runtime: Container Station (Docker)
The Problem: Why cuInit Returns Error 3
The root cause of almost every GPU acceleration failure on QNAP comes down to this error code: cuInit: 3 — which means CUDA_ERROR_NOT_INITIALIZED.
On a normal Linux system, the NVIDIA driver loads once and stays loaded. On QNAP, the GPU driver stack is split into two separate components that must both be running:
-
NvKernelDriver — loads the kernel modules (
nvidia.ko,nvidia-uvm.ko, etc.) -
NVIDIA_GPU_DRV — sets up the userspace libraries (libcuda.so, ldcache, symlinks)
The kernel modules load reliably at boot. The userspace libraries are the problem. If NVIDIA_GPU_DRV.sh start hasn’t run successfully, no CUDA context can be created — even if nvidia-smi works fine on the host.
There’s a second compounding issue: the root tmpfs on QuTS hero h5.3.x defaults to a small size (~442MB). The GPU needs to allocate kernel memory for its MMU fault buffer, and if the root filesystem doesn’t have enough headroom, every CUDA context attempt fails silently.
The Fix: NvKernelDriver Start Script
The solution is to modify the NvKernelDriver start script to handle everything automatically on boot. On h5.3.3 the script lives at:
/share/ZFS1_DATA/.qpkg.local/NvKernelDriver/qpkg_NvKernelDriver.sh
Edit the start section to do the following in order:
1. Expand the root tmpfs
mount -o remount,size=800M /
This gives the kernel enough space to allocate GPU fault buffers. Without this, cuInit fails even with the driver fully loaded.
2. Load kernel modules
The standard insmod sequence for the open kernel modules:
insmod ${QPKG_ROOT}/nvidia.ko
insmod ${QPKG_ROOT}/nvidia-modeset.ko
insmod ${QPKG_ROOT}/nvidia-uvm.ko
3. Create the firmware symlink
The GPU firmware (gsp_ga10x.bin) is 74MB and needs to be accessible at /lib/firmware/nvidia. Create a symlink:
ln -sf ${QPKG_ROOT}/kernel-open/firmware/nvidia /lib/firmware/nvidia
4. Start the GPU userspace driver
sleep 2
/share/ZFS530_DATA/.qpkg/NVIDIA_GPU_DRV/NVIDIA_GPU_DRV.sh start
This step sets up libcuda.so, rebuilds the ldcache, and creates all the symlinks the container runtime needs to inject GPU libraries into containers.
Container Station Runtime Configuration
In /share/ZFS530_DATA/.qpkg/container-station/etc/nvidia-container-runtime/config.toml, ensure:
no-cgroups = true
This is required because QNAP’s container environment doesn’t use standard cgroup device access.
Docker Compose Configuration for GPU Containers
Your GPU container needs these settings:
runtime: nvidia-runtime
privileged: true
environment:
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: compute,utility
The privileged: true is required for cgroup device access to work correctly with QNAP’s container runtime.
Building a Working GPU Container Image
Base Image
Use NVIDIA’s official CUDA image with cuDNN:
FROM nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04
System Dependencies
PaddlePaddle and onnxruntime-gpu need several system libraries that aren’t in the base image:
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y \
python3 python3-pip \
libgl1 \
libglib2.0-0 \
libgomp1 \
libjpeg-dev \
zlib1g-dev \
libpng-dev \
libtiff-dev \
tesseract-ocr \
tesseract-ocr-eng && \
rm -rf /var/lib/apt/lists/*
Installing PaddlePaddle GPU
Install from the Paddle index pinned to cu126 — use this even if you’re running CUDA 12.9, as it’s the most recent stable GPU build and is fully compatible:
RUN pip install --no-cache-dir \
paddlepaddle-gpu==3.0.0 \
-i https://www.paddlepaddle.org.cn/packages/stable/cu126/
Installing PaddleOCR / PaddleX and onnxruntime-gpu
RUN pip install --no-cache-dir \
paddleocr==3.1.1 \
paddlex==3.1.1 \
onnxruntime-gpu \
rembg
Pinning Conflicting Dependencies
This is one of the less obvious parts. PaddleX pulls in langchain, and recent langchain versions conflict with PaddleX’s internal API calls. You need to pin these:
RUN pip install --no-cache-dir \
PyYAML==6.0.2 \
langchain==0.3.28 \
langchain-core==0.3.83 \
langchain-text-splitters==0.3.11
Without these pins, PaddleX pipeline calls will fail at runtime with confusing import errors even though the packages appear installed correctly.
The Critical LD_LIBRARY_PATH Fix
This is the most important piece that most guides miss. When PaddlePaddle GPU is installed via pip, it brings its own NVIDIA CUDA libraries as Python packages under nvidia/. These need to be on LD_LIBRARY_PATH or the GPU won’t initialize — the system CUDA libraries in the base image alone are not sufficient.
ENV LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cudnn/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cusolver/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cusparse/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cuda_runtime/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cuda_nvrtc/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/nvtx/lib
The symptom when this is missing is paddle device: cpu even though cuInit succeeds. Paddle loads but can’t find its required CUDA math libraries so silently falls back to CPU.
Complete Dockerfile
Putting it all together:
FROM nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04
WORKDIR /srv/app
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y \
python3 python3-pip \
build-essential \
libgl1 \
libglib2.0-0 \
libgomp1 \
libjpeg-dev \
zlib1g-dev \
libpng-dev \
libtiff-dev \
tesseract-ocr \
tesseract-ocr-eng \
pciutils && \
rm -rf /var/lib/apt/lists/*
# Install PaddlePaddle GPU first from the Paddle index
RUN pip install --no-cache-dir \
paddlepaddle-gpu==3.0.0 \
-i https://www.paddlepaddle.org.cn/packages/stable/cu126/
# Install PaddleOCR, PaddleX, onnxruntime-gpu
RUN pip install --no-cache-dir \
paddleocr==3.1.1 \
paddlex==3.1.1 \
onnxruntime-gpu \
rembg
# Pin conflicting langchain/PyYAML versions pulled in by PaddleX
RUN pip install --no-cache-dir \
PyYAML==6.0.2 \
langchain==0.3.28 \
langchain-core==0.3.83 \
langchain-text-splitters==0.3.11
# Critical: expose Paddle's bundled NVIDIA libraries
ENV LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cudnn/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cusolver/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cusparse/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cuda_runtime/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cuda_nvrtc/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/nvtx/lib
Handling CUDA Initialization Timing
On reboot, there’s a race between the NvKernelDriver finishing GPU initialization and your containers starting. The GPU driver can take 30–90 seconds to fully initialize after the kernel modules load.
Add a CUDA wait loop to your container startup command before launching your application:
for i in $(seq 1 30); do
python3 -c "import ctypes; c=ctypes.CDLL('libcuda.so.1'); r=c.cuInit(0); exit(0 if r==0 else 1)" \
&& echo "CUDA ready" && break \
|| echo "Waiting for CUDA attempt $i/30" && sleep 10
done
This retries every 10 seconds for up to 5 minutes, then proceeds regardless. On a clean boot you’ll typically see CUDA ready on the first or second attempt.
Verifying Everything Works
After a clean reboot, verify with:
# Check kernel modules loaded
lsmod | grep nvidia
# Check driver version
cat /proc/driver/nvidia/version
# Test inside your container
docker exec your-container python3 -c "
import paddle
import onnxruntime as ort
print('paddle device:', paddle.device.get_device())
print('onnxruntime:', ort.get_available_providers())
"
Expected output:
paddle device: gpu:0
onnxruntime: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
Mid-Session Container Restarts
If you restart a container without rebooting the NAS, you’ll need to manually reinitialize the GPU userspace driver:
sudo /share/ZFS530_DATA/.qpkg/NVIDIA_GPU_DRV/NVIDIA_GPU_DRV.sh start
docker restart your-gpu-container
This is because the GPU userspace state can become stale after container restarts. A full NAS reboot is the only reliable way to get a completely clean initialization — but the CUDA wait loop in your container startup handles the timing automatically.
Common Pitfalls
cuInit: 3 after driver start — The root tmpfs hasn’t been expanded. Run mount -o remount,size=800M / and restart your container.
paddle device: cpu despite CUDA ready — The container started before the GPU driver fully initialized this session. Restart the container.
nvidia-smi works but CUDA fails — nvidia-smi uses the management library (libnvidia-ml) which doesn’t require a CUDA context. The absence of cuInit success doesn’t show up in nvidia-smi output.
Firmware symlink missing — If you see GSP firmware errors in dmesg, the symlink at /lib/firmware/nvidia is missing. It’s recreated on each boot by the NvKernelDriver script, but verify with ls /lib/firmware/nvidia.
Container runtime ldcache stale — If you upgrade your GPU driver QPKG, the ldcache at /share/ZFS530_DATA/.qpkg/container-station/opt/nvidia/etc/ld.so.cache may point to old library versions. A reboot after the driver upgrade resolves this.
Summary
Getting GPU acceleration working on QNAP requires understanding that it’s a two-part driver system, that the userspace component needs explicit initialization, and that the root tmpfs needs more headroom than it gets by default. Once those three things are handled in the NvKernelDriver start script, the system boots cleanly and CUDA works reliably without any manual intervention.
It took a lot of troubleshooting to get here — hopefully this saves you the same journey.