Getting GPU-Accelerated PaddlePaddle Working in Docker on QNAP NAS

mjellybaby · April 30, 2026, 7:03pm

Getting GPU-Accelerated PaddlePaddle Working in Docker on QNAP NAS

If you’ve landed here, you’re probably staring at paddle device: cpu when you expected gpu:0, wondering why your NVIDIA GPU isn’t being recognized inside a Docker container on your QNAP NAS. This guide documents exactly what it takes to get full CUDA acceleration working — including PaddlePaddle, PaddleOCR, PaddleX, and onnxruntime-gpu — on a QNAP running QuTS hero h5.3.x.

This is not a simple setup. QNAP’s GPU driver architecture has some quirks that aren’t well documented, and the path to a working setup requires understanding a few key pieces. But once it’s running, it’s rock solid.

My Setup

NAS: QNAP TS-h1277AXU-RP
GPU: NVIDIA RTX 2000 Ada Generation (16GB VRAM)
OS: QuTS hero h5.3.3
Driver: 575.64.05, CUDA 12.9
Container Runtime: Container Station (Docker)

The Problem: Why `cuInit` Returns Error 3

The root cause of almost every GPU acceleration failure on QNAP comes down to this error code: cuInit: 3 — which means CUDA_ERROR_NOT_INITIALIZED.

On a normal Linux system, the NVIDIA driver loads once and stays loaded. On QNAP, the GPU driver stack is split into two separate components that must both be running:

NvKernelDriver — loads the kernel modules (nvidia.ko, nvidia-uvm.ko, etc.)
NVIDIA_GPU_DRV — sets up the userspace libraries (libcuda.so, ldcache, symlinks)

The kernel modules load reliably at boot. The userspace libraries are the problem. If NVIDIA_GPU_DRV.sh start hasn’t run successfully, no CUDA context can be created — even if nvidia-smi works fine on the host.

There’s a second compounding issue: the root tmpfs on QuTS hero h5.3.x defaults to a small size (~442MB). The GPU needs to allocate kernel memory for its MMU fault buffer, and if the root filesystem doesn’t have enough headroom, every CUDA context attempt fails silently.

The Fix: NvKernelDriver Start Script

The solution is to modify the NvKernelDriver start script to handle everything automatically on boot. On h5.3.3 the script lives at:

/share/ZFS1_DATA/.qpkg.local/NvKernelDriver/qpkg_NvKernelDriver.sh

Edit the start section to do the following in order:

1. Expand the root tmpfs

mount -o remount,size=800M /

This gives the kernel enough space to allocate GPU fault buffers. Without this, cuInit fails even with the driver fully loaded.

2. Load kernel modules

The standard insmod sequence for the open kernel modules:

insmod ${QPKG_ROOT}/nvidia.ko
insmod ${QPKG_ROOT}/nvidia-modeset.ko
insmod ${QPKG_ROOT}/nvidia-uvm.ko

3. Create the firmware symlink

The GPU firmware (gsp_ga10x.bin) is 74MB and needs to be accessible at /lib/firmware/nvidia. Create a symlink:

ln -sf ${QPKG_ROOT}/kernel-open/firmware/nvidia /lib/firmware/nvidia

4. Start the GPU userspace driver

sleep 2
/share/ZFS530_DATA/.qpkg/NVIDIA_GPU_DRV/NVIDIA_GPU_DRV.sh start

This step sets up libcuda.so, rebuilds the ldcache, and creates all the symlinks the container runtime needs to inject GPU libraries into containers.

Container Station Runtime Configuration

In /share/ZFS530_DATA/.qpkg/container-station/etc/nvidia-container-runtime/config.toml, ensure:

no-cgroups = true

This is required because QNAP’s container environment doesn’t use standard cgroup device access.

Docker Compose Configuration for GPU Containers

Your GPU container needs these settings:

runtime: nvidia-runtime
privileged: true
environment:
  NVIDIA_VISIBLE_DEVICES: all
  NVIDIA_DRIVER_CAPABILITIES: compute,utility

The privileged: true is required for cgroup device access to work correctly with QNAP’s container runtime.

Building a Working GPU Container Image

Base Image

Use NVIDIA’s official CUDA image with cuDNN:

FROM nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04

System Dependencies

PaddlePaddle and onnxruntime-gpu need several system libraries that aren’t in the base image:

RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y \
        python3 python3-pip \
        libgl1 \
        libglib2.0-0 \
        libgomp1 \
        libjpeg-dev \
        zlib1g-dev \
        libpng-dev \
        libtiff-dev \
        tesseract-ocr \
        tesseract-ocr-eng && \
    rm -rf /var/lib/apt/lists/*

Installing PaddlePaddle GPU

Install from the Paddle index pinned to cu126 — use this even if you’re running CUDA 12.9, as it’s the most recent stable GPU build and is fully compatible:

RUN pip install --no-cache-dir \
    paddlepaddle-gpu==3.0.0 \
    -i https://www.paddlepaddle.org.cn/packages/stable/cu126/

Installing PaddleOCR / PaddleX and onnxruntime-gpu

RUN pip install --no-cache-dir \
    paddleocr==3.1.1 \
    paddlex==3.1.1 \
    onnxruntime-gpu \
    rembg

Pinning Conflicting Dependencies

This is one of the less obvious parts. PaddleX pulls in langchain, and recent langchain versions conflict with PaddleX’s internal API calls. You need to pin these:

RUN pip install --no-cache-dir \
    PyYAML==6.0.2 \
    langchain==0.3.28 \
    langchain-core==0.3.83 \
    langchain-text-splitters==0.3.11

Without these pins, PaddleX pipeline calls will fail at runtime with confusing import errors even though the packages appear installed correctly.

The Critical LD_LIBRARY_PATH Fix

This is the most important piece that most guides miss. When PaddlePaddle GPU is installed via pip, it brings its own NVIDIA CUDA libraries as Python packages under nvidia/. These need to be on LD_LIBRARY_PATH or the GPU won’t initialize — the system CUDA libraries in the base image alone are not sufficient.

ENV LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cudnn/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cusolver/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cusparse/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cuda_runtime/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cuda_nvrtc/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/nvtx/lib

The symptom when this is missing is paddle device: cpu even though cuInit succeeds. Paddle loads but can’t find its required CUDA math libraries so silently falls back to CPU.

Complete Dockerfile

Putting it all together:

FROM nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04

WORKDIR /srv/app

RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y \
        python3 python3-pip \
        build-essential \
        libgl1 \
        libglib2.0-0 \
        libgomp1 \
        libjpeg-dev \
        zlib1g-dev \
        libpng-dev \
        libtiff-dev \
        tesseract-ocr \
        tesseract-ocr-eng \
        pciutils && \
    rm -rf /var/lib/apt/lists/*

# Install PaddlePaddle GPU first from the Paddle index
RUN pip install --no-cache-dir \
    paddlepaddle-gpu==3.0.0 \
    -i https://www.paddlepaddle.org.cn/packages/stable/cu126/

# Install PaddleOCR, PaddleX, onnxruntime-gpu
RUN pip install --no-cache-dir \
    paddleocr==3.1.1 \
    paddlex==3.1.1 \
    onnxruntime-gpu \
    rembg

# Pin conflicting langchain/PyYAML versions pulled in by PaddleX
RUN pip install --no-cache-dir \
    PyYAML==6.0.2 \
    langchain==0.3.28 \
    langchain-core==0.3.83 \
    langchain-text-splitters==0.3.11

# Critical: expose Paddle's bundled NVIDIA libraries
ENV LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cudnn/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cublas/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cusolver/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cusparse/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cuda_runtime/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/cuda_nvrtc/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib:\
/usr/local/lib/python3.12/dist-packages/nvidia/nvtx/lib

Handling CUDA Initialization Timing

On reboot, there’s a race between the NvKernelDriver finishing GPU initialization and your containers starting. The GPU driver can take 30–90 seconds to fully initialize after the kernel modules load.

Add a CUDA wait loop to your container startup command before launching your application:

for i in $(seq 1 30); do
    python3 -c "import ctypes; c=ctypes.CDLL('libcuda.so.1'); r=c.cuInit(0); exit(0 if r==0 else 1)" \
    && echo "CUDA ready" && break \
    || echo "Waiting for CUDA attempt $i/30" && sleep 10
done

This retries every 10 seconds for up to 5 minutes, then proceeds regardless. On a clean boot you’ll typically see CUDA ready on the first or second attempt.

Verifying Everything Works

After a clean reboot, verify with:

# Check kernel modules loaded
lsmod | grep nvidia

# Check driver version
cat /proc/driver/nvidia/version

# Test inside your container
docker exec your-container python3 -c "
import paddle
import onnxruntime as ort
print('paddle device:', paddle.device.get_device())
print('onnxruntime:', ort.get_available_providers())
"

Expected output:

paddle device: gpu:0
onnxruntime: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

Mid-Session Container Restarts

If you restart a container without rebooting the NAS, you’ll need to manually reinitialize the GPU userspace driver:

sudo /share/ZFS530_DATA/.qpkg/NVIDIA_GPU_DRV/NVIDIA_GPU_DRV.sh start
docker restart your-gpu-container

This is because the GPU userspace state can become stale after container restarts. A full NAS reboot is the only reliable way to get a completely clean initialization — but the CUDA wait loop in your container startup handles the timing automatically.

Common Pitfalls

cuInit: 3 after driver start — The root tmpfs hasn’t been expanded. Run mount -o remount,size=800M / and restart your container.

paddle device: cpu despite CUDA ready — The container started before the GPU driver fully initialized this session. Restart the container.

nvidia-smi works but CUDA fails — nvidia-smi uses the management library (libnvidia-ml) which doesn’t require a CUDA context. The absence of cuInit success doesn’t show up in nvidia-smi output.

Firmware symlink missing — If you see GSP firmware errors in dmesg, the symlink at /lib/firmware/nvidia is missing. It’s recreated on each boot by the NvKernelDriver script, but verify with ls /lib/firmware/nvidia.

Container runtime ldcache stale — If you upgrade your GPU driver QPKG, the ldcache at /share/ZFS530_DATA/.qpkg/container-station/opt/nvidia/etc/ld.so.cache may point to old library versions. A reboot after the driver upgrade resolves this.

Summary

Getting GPU acceleration working on QNAP requires understanding that it’s a two-part driver system, that the userspace component needs explicit initialization, and that the root tmpfs needs more headroom than it gets by default. Once those three things are handled in the NvKernelDriver start script, the system boots cleanly and CUDA works reliably without any manual intervention.

It took a lot of troubleshooting to get here — hopefully this saves you the same journey.

Lucas · May 3, 2026, 12:12am

Hi @mjellybaby ,

Thank you for sharing such a detailed write-up.

This is a very valuable contribution for users who are exploring GPU-accelerated AI workloads in Docker on QNAP NAS, especially for use cases involving PaddlePaddle, PaddleOCR, PaddleX, and onnxruntime-gpu.

We appreciate the time you spent documenting your environment, the issues you encountered, and the steps you used to verify that GPU acceleration was working inside the container. Practical experience like this is very helpful for other advanced users who may be trying similar AI or OCR workflows on their NAS.

We will share this with the relevant team for internal reference and further evaluation. While actual configurations may vary depending on the NAS model, GPU, driver version, QTS/QuTS hero version, and container setup, your findings provide a useful technical reference for the community.

Thank you again for taking the time to document and share this.

mjellybaby · May 3, 2026, 2:53am

Hi Lucas, thank you for the response and for passing it along to the team — that’s genuinely appreciated.

Since you mentioned internal evaluation, I wanted to take the opportunity to be specific about what would make GPU accelerator support in Container Station meaningfully better for developers. These aren’t complaints — they’re concrete things that would eliminate the entire class of problems documented in this guide:

1. Ensure NvKernelDriver and NVIDIA_GPU_DRV start in the correct order on every boot, automatically. Right now, the userspace driver (NVIDIA_GPU_DRV.sh start) doesn’t always complete before containers launch, and there’s no native mechanism to sequence this. Containers using nvidia-runtime should have a guaranteed-ready GPU before they start.

2. Expand the root tmpfs by default when a GPU QPKG is installed. The 442MB default root filesystem is too small for GPU fault buffer allocation on Ada Lovelace and newer architectures. This causes silent cuInit failures that are extremely difficult to diagnose. Automatically sizing the tmpfs to at least 800MB when NvKernelDriver is installed would eliminate this entirely.

3. Keep NvKernelDriver and NVIDIA_GPU_DRV version-locked to each other. The h5.3.3 OS ships with a 550.x kernel module but the GPU Driver QPKG installs 575.x userspace libraries. This mismatch is the source of a lot of pain. They should always be the same version and updated together.

4. Publish stable, documented paths for GPU libraries. The paths to libcuda.so, the ldcache, and the container runtime config vary between QuTS hero versions and aren’t documented. A stable API — even just a /opt/nvidia/ symlink that always points to the active driver — would make it possible to write container configurations that survive driver upgrades.

5. Add a GPU health check to Container Station. A simple status indicator showing whether cuInit succeeds from the host, which driver versions are loaded, and whether the kernel/userspace versions match would save hours of debugging for users.

None of these require changing the fundamental architecture — they’re mostly sequencing, sizing, and documentation improvements. The QNAP hardware is genuinely capable; it’s the gap between the driver install and a working container environment that’s the hard part right now.

Thanks again for engaging with this — happy to provide more detail on any of the above.

Getting GPU-Accelerated PaddlePaddle Working in Docker on QNAP NAS

Getting GPU-Accelerated PaddlePaddle Working in Docker on QNAP NAS

My Setup

The Problem: Why cuInit Returns Error 3

The Fix: NvKernelDriver Start Script

1. Expand the root tmpfs

2. Load kernel modules

3. Create the firmware symlink

4. Start the GPU userspace driver

Container Station Runtime Configuration

Docker Compose Configuration for GPU Containers

Building a Working GPU Container Image

Base Image

System Dependencies

Installing PaddlePaddle GPU

Installing PaddleOCR / PaddleX and onnxruntime-gpu

Pinning Conflicting Dependencies

The Critical LD_LIBRARY_PATH Fix

Complete Dockerfile

Handling CUDA Initialization Timing

Verifying Everything Works

Mid-Session Container Restarts

Common Pitfalls

Summary

The Problem: Why `cuInit` Returns Error 3