Selecting GPUs in PyTorch.
14 Apr 2021I have a weird configuration, one older GPU that is unsupported by PyTorch and one newer GPU that is supported by PyTorch. With Docker, I was able to specify the correct GPU, and it worked.
Now I am directly using PyTorch without the Docker interface, but ran into some snags specifying the GPU. This is not hard, but to lay it out super clearly to other newbies in this area, will go through it.
First, you can use nvidia-smi
to view the GPUs on your system, and the proported assignment of numbers to GPUs (more on that later).
Wed Apr 14 09:28:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro K2000 On | 00000000:03:00.0 On | N/A |
| 30% 40C P8 N/A / N/A | 731MiB / 1998MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:04:00.0 Off | N/A |
| 40% 28C P8 15W / 260W | 2043MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1540 G /usr/lib/xorg/Xorg 58MiB |
| 0 N/A N/A 1778 G /usr/bin/gnome-shell 51MiB |
| 0 N/A N/A 2826 G /usr/lib/xorg/Xorg 298MiB |
| 0 N/A N/A 2942 G /usr/bin/gnome-shell 249MiB |
| 0 N/A N/A 5790 G /usr/lib/firefox/firefox 0MiB |
| 0 N/A N/A 8840 G /usr/lib/firefox/firefox 0MiB |
| 0 N/A N/A 9053 G /usr/lib/firefox/firefox 58MiB |
| 1 N/A N/A 9750 C /usr/bin/python3 2039MiB |
+-----------------------------------------------------------------------------+
(Side note: Used-to me would have used a screenshot. Why not set as code, bash style? Much better for readability I think.)
So nvidia-smi
is indicating that GPU 1 is the supported GPU. OK.
Instructions from various forums, ex. PyTorch say to specify the GPU from the command line, such as
CUDA_VISIBLE_DEVICES=1
which I was aware of. BUT! you actually need to do
CUDA_VISIBLE_DEVICES=1 python test.py
That environmental variable will not persist through the session unless you do an export,
export CUDA_VISIBLE_DEVICES=0
and then you can run your test code python test.py
(or jupyter notebook
). To make this variable stick across bash sessions, you can set it, but I’ll not go into that right now. I knew this in one part of my brain; I blame the pandemic.
Ok, so next topic, some good test code, my test.py
, many similarities to Chris Albion’s.
import torch
print("Is cuda available?", torch.cuda.is_available())
print("Is cuDNN version:", torch.backends.cudnn.version())
print("cuDNN enabled? ", torch.backends.cudnn.enabled)
print("Device count?", torch.cuda.device_count())
print("Current device?", torch.cuda.current_device())
print("Device name? ", torch.cuda.get_device_name(torch.cuda.current_device()))
x = torch.rand(5, 3)
print(x)
When the numbering in nvidia-smi
is wonky
Running the test code helps find where things are going weird, and I got an indication from the messages in the forum that sometimes the numbering in nvidia-smi
does not match that needed in CUDA_VISIBLE_DEVICES.
In my case, the numbering in CUDA_VISIBLE_DEVICES is flipped from nvidia-smi
. YES.
First try:
$ CUDA_VISIBLE_DEVICES=1 python test.py
Is cuda available? True
Is cuDNN version: 8005
cuDNN enabled? True
Device count? 1
/home/atabb/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:81: UserWarning:
Found GPU0 Quadro K2000 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability that we support is 3.5.
warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
Current device? 0
Device name? Quadro K2000
tensor([[0.1904, 0.3988, 0.5989],
[0.5658, 0.7823, 0.9238],
[0.8135, 0.3541, 0.9398],
[0.6298, 0.7443, 0.5831],
[0.0502, 0.8443, 0.5911]])
Oh no, that’s the old GPU.
Second try:
$ CUDA_VISIBLE_DEVICES=0 python test.py
Is cuda available? True
Is cuDNN version: 8005
cuDNN enabled? True
Device count? 1
Current device? 0
Device name? GeForce RTX 2080 Ti
tensor([[0.7686, 0.0573, 0.3836],
[0.1975, 0.9561, 0.8107],
[0.9169, 0.3892, 0.6475],
[0.2461, 0.6731, 0.5082],
[0.4824, 0.3800, 0.9623]])
CUDA_VISIBLE_DEVICES=0
for me, then! If you need to specify more than one GPU, use a comma.
$ CUDA_VISIBLE_DEVICES=0,1 python test.py
Is cuda available? True
Is cuDNN version: 8005
cuDNN enabled? True
Device count? 2
/home/atabb/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:81: UserWarning:
Found GPU1 Quadro K2000 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability that we support is 3.5.
warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
Current device? 0
Device name? GeForce RTX 2080 Ti
tensor([[0.3730, 0.6497, 0.8243],
[0.8727, 0.8672, 0.2272],
[0.2291, 0.4075, 0.7580],
[0.9754, 0.8914, 0.3489],
[0.0208, 0.9947, 0.0432]])