timerring

Docker With GPU

March 18, 2025 · 3 min read · Page View:
Tutorial
Docker | GPU

NVIDIA Container Toolkit Architecture

If you have any questions, feel free to comment below.

This post is to introduce how to use gpu in docker, which also covers the basic usage of gpu on the host machine.

Basic GPU components #

  • GPU Driver
  • CUDA Toolkit

The CUDA driver and kernel mode components are delivered together in the NVIDIA display driver package(The CUDA Toolkit is generally optional)1:

The relationship between CUDA Toolkit and GPU driver The relationship between CUDA Toolkit and GPU driver

You can checkout the device here:

lspci | grep NVIDIA

Install Driver on your device #

Download your corresponding driver from NVIDIA Driver Downloads. Then run the .run file directly.

sh NVIDIA-Linux-x86_64-550.54.14.run

# check the driver
nvidia-smi

ATTENTION: the CUDA version here is only represent the max version of CUDA that the driver supports.

Install CUDA Toolkit #

Download your corresponding CUDA Toolkit from CUDA Toolkit Downloads. Also select the run file.

We have installed the driver before, so we only need to install the CUDA Toolkit here.

wget xxxxxx
sudo sh cuda_xxxx.run
===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-12.2/

Please make sure that
 -   PATH includes /usr/local/cuda-12.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.2/lib64, or, add /usr/local/cuda-12.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log

Follow the instructions:

# add the CUDA to the PATH
export PATH=/usr/local/cuda-12.2/bin:$PATH
# add the CUDA lib64 to the LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH

# check the CUDA version
nvcc -V

Test the CUDA environment #

We can test it by running a simple program:

# pip install torch
# python3 check_cuda_pytorch.py
import torch

def check_cuda_with_pytorch():
    """Check if the PyTorch CUDA environment is working correctly"""
    try:
        print("Checking PyTorch CUDA environment:")
        if torch.cuda.is_available():
            print(f"CUDA device is available, the current CUDA version is: {torch.version.cuda}")
            print(f"PyTorch version is: {torch.__version__}")
            print(f"Detected {torch.cuda.device_count()} CUDA devices.")
            for i in range(torch.cuda.device_count()):
                print(f"Device {i}: {torch.cuda.get_device_name(i)}")
                print(f"Device {i} total memory: {torch.cuda.get_device_properties(i).total_memory / (1024 ** 3):.2f} GB")
                print(f"Device {i} current memory usage: {torch.cuda.memory_allocated(i) / (1024 ** 3):.2f} GB")
                print(f"Device {i} max memory usage: {torch.cuda.memory_reserved(i) / (1024 ** 3):.2f} GB")
        else:
            print("CUDA device is not available.")
    except Exception as e:
        print(f"Error when checking PyTorch CUDA environment: {e}")

if __name__ == "__main__":
    check_cuda_with_pytorch()

The call process <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> The call process <sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>

Docker with GPU #

Install nvidia-container-toolkit #

The main purpose of this component is to mount the GPU device to the container.3

You can follow the newest version of the document to install it.

Old version:

    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }

New version: nvidia-ctk

sudo nvidia-ctk runtime configure --runtime=docker
# Restart the docker service
sudo systemctl restart docker

Start container with –gpu parameter #

The whole call process is as follows:

The call process of nvidia-container-toolkit <sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> The call process of nvidia-container-toolkit <sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>

The whole process is changed from containerd –> runc to containerd –> nvidia-container-runtime –> runc.

Then nvidia-container-runtime intercepts the container spec, and adds the gpu related configuration to it, and then passes the spec to runc, which contains the gpu related information.

NVIDIA Container Toolkit<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup> NVIDIA Container Toolkit<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup>

From above, we can see that the cuda tookit is already in the container. So we only need to use a image with cuda, and there is no need to install CUDA on the host machine.

Then we can start the container with the --gpus parameter: gpus parameter:

  • all: allocate all GPUs to the container
  • device=<id>[,<id>...]: allocate specific GPU to the container(check via nvidia-smi)
  • 'all,"capabilities=compute,utility,video"': allocate all GPUs to the container, and set the corresponding capabilities to the container. To be more specific, if you want to use nvidia-smi you should set the utility capability, and video for NVDEC and so on.
docker run --rm --gpus all  nvidia/cuda:12.0.1-runtime-ubuntu22.04 nvidia-smi

Reference #

Related readings


<< prev | Some Useful... Continue strolling Is There an Rss... | next >>

If you find this blog useful and want to support my blog, need my skill for something, or have a coffee chat with me, feel free to: