Segmentation fault when GPUs are already used #152

nouiz · 2015-11-11T21:50:05Z

When I set the nvidia driver in exclusive mode and one of the GPU is already used by another process, I get a segmentation fault:

$ python -c "import tensorflow as tf;tf.InteractiveSession()"
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12
Segmentation fault (core dumped)

If I limit the visible GPUs to only GPUs that have nothing running on them, it don't segfault:

$CUDA_VISIBLE_DEVICES=1 python -c "import tensorflow as tf;tf.InteractiveSession()"
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:09:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 12105628263
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 12

Here is my output of nvidia-smi:

$ nvidia-smi 
Wed Nov 11 16:48:27 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750     Off  | 0000:05:00.0      On |                  N/A |
| N/A   48C    P8     0W /  38W |     25MiB /  2047MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:06:00.0     Off |                  N/A |
| 42%   82C    P2   127W / 250W |    262MiB / 12287MiB |     39%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:09:00.0     Off |                  N/A |
| 22%   46C    P8    17W / 250W |     23MiB / 12287MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:0A:00.0     Off |                  N/A |
| 22%   37C    P8    15W / 250W |    361MiB / 12287MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1171    G   /usr/bin/X                                      17MiB |
|    1     32740    C   python                                         209MiB |
|    3      9429    C   python                                         336MiB |
+-----------------------------------------------------------------------------+

I suppose that in the code that check the available GPUs, it don't handle correctly a case when one of the GPUs can't be used.

The text was updated successfully, but these errors were encountered:

zheng-xq · 2015-11-11T23:49:58Z

@nouiz, you are correct in your analysis. TensorFlow tries to grab all the GPU it sees from the system that passes its criteria. CUDA_VISIBLE_DEVICES is the solution I would suggest. Please let us know if that is not enough.

skearnes · 2015-11-12T16:03:33Z

@zheng-xq +1 on CUDA_VISIBLE_DEVICES; this fixes the issue for me.

vrv · 2015-11-13T22:20:24Z

Re-open if CUDA_VISIBLE_DEVICES isn't good enough

noisychannel · 2015-12-11T00:05:04Z

@vrv This is still a problem when there are multiple GPUs (in exclusive mode) on the machine and some of them have processes running on them. Setting CUDA_VISIBLE_DEVICES is not ideal if you do not actually know which GPU is available. nvidia-smi will only show the current state of the GPUs and there's no way of ensuring that while you spawn your process, no other process occupied the GPU you set for use. Is there a way of polling GPUs and assigning the process to the first available, compatible GPU on the machine instead of having to manually specify the GPU id?

PS: I cannot reopen this issue.

vrv · 2015-12-11T00:08:39Z

@zheng-xq: is it possible to query the status of the exclusive mode bit?

zheng-xq · 2015-12-11T04:57:27Z

@noisychannel, even if we can query which GPU is available, and by the time we actually try to assign the GPU, it could be taken by other processes. This is not very different from having a script query GPU status through nvidia-smi, and set the GPU id accordingly.

The only way I can think of that makes your case easier is actually to create the context, and if it fails, ignore that failure and move on to the next. However, this is undesirable for many more common cases, where such context creation errors are signs of bigger problems.

noisychannel · 2015-12-11T18:17:52Z

The established way to do this seems to be:

Set the GPUs to operate in exclusive mode.
Do not call cudaSetDevice().
When you run your program, it will try and create the context on the first GPU and fail if the GPU is busy. Assuming you're using CUDART, it should silently try and create the context on the next GPUs and fail if all GPUs are busy.

As for exception handling while creating the context, it seems like the this scenario can be handled with the exceptions returned by cuCtxCreate().

A solution like this is optimal when you delegate the scheduling of your jobs to something like SGE and cannot trust the script based querying approach which I think is common for large clusters.

zheng-xq · 2015-12-11T19:15:52Z

This would work fine if the framework only wants to use a single GPU. TensorFlow is designed to use multiple GPU seamlessly at the same time. So at the initialization stage, it grabs all the visible devices that are compatible, since the client program might use all of them. The actual device used is only known at the graph construction/execution time, which is at a much later stage.

For most of our existing users, the GPUs that are available to a particular job is known when it starts. So it is okay for TensorFlow to take all of them. A few options to think about:

Is it possible for the job scheduler to reserve the GPUs in your clusters? That would work best with TensorFlow.
If you only know the list of candidate GPUs, and the number of GPUs to use, we can add a special mode where we try to create context in the candidate list, and stops if desired number of GPUs are taken. This requires more plumbing to both TensorFlow and the underlying Stream-Executor that actually manages GPUs. We would much prefer Add support for Python 3.x #1 if that is possible for you.

noisychannel · 2015-12-11T19:29:45Z

@zheng-xq For 1: AFAIK, the only responsibility of most job schedulers is to ensure that you will have access to the number of GPUs you have asked for on the machine which has been assigned to you. It leaves the actual CPU/GPU scheduling of the processes on that machine to the OS/Cuda driver. Also, hypothetically, if the scheduler was to attempt something of this sort, it would not be able to block the GPU without creating something like a context.

For 2 : I understand that this is probably painful. However, it does add a more general set of features. The case where I want k out of n GPUs on the machine is most prominent. Eg,, it would be able to handle the case where the candidate list is the set of all GPUs on the machine and I want only 2 available and compatible GPUs from that list. Having the ability to specify how many GPUs you want to use will be a great addition and this will tie in perfectly with most job schedulers. So to tensorflow, I specify, I want to run this thing on 2 GPUs and I tell the job scheduler that I need 2 GPUs.

skearnes · 2015-12-11T19:34:51Z

Drive-by comment: the sysadmin for my local SGE cluster was able to have an environment variable set that tells me which GPUs my jobs have been assigned. I then use this variable to set CUDA_VISIBLE_DEVICES. I would guess that most common schedulers have similar capabilities, although they may require root-level tweaking.

noisychannel · 2015-12-11T21:45:52Z

This is useful to know. However, something like this would require all jobs on the cluster to respect that environment variable when using GPUs. And it takes away some of the scheduling responsibility that should ideally belong to the CUDA driver, adding another layer of complexity. In fact, most other toolkits that are GPU capable will not require something like this and as long as the GPUs are set in exclusive mode, they work just fine.

jtrmal · 2015-12-11T21:56:37Z

Another drive-by comment: when the allocation fails, the app actually segfaults on our machines. That does not provide any possibility for the user to recover from it at the python level.
A good-enough solution would be
a) throw an exception so that user can it catch and try (for example) another device
b) ideally, adding /gpu:auto which would just iterate through the devices and allocate the one that is available.

zheng-xq · 2015-12-14T19:00:47Z

@noisychannel, please see if you can work around this issue at the moment. Meanwhile, I will investigate if we can officially support the soft-try-and-device-limit approach in TensorFlow.

noisychannel · 2015-12-14T22:04:49Z

Okay. Thanks. Keep me posted.

girving · 2016-03-08T19:15:34Z

@zheng-xq: Is this still an issue?

noisychannel · 2016-03-09T00:02:24Z

I'm not aware of a "soft-try-and-assign-device" implementation in Tensorflow.

aselle · 2016-08-15T22:50:53Z

Closing this automatically due to lack of recent activity. Please reopen when new information becomes available. @zheng-xq, please re-open if you plan to be still working on this.

AndreasMadsen · 2016-10-21T16:11:02Z

@aselle I'm on a shared multi GPU machine and this doesn't appear to have been fixed.

aselle · 2016-10-21T16:19:37Z

@AndreasMadsen, what is your exact hardware/software configuration and steps to reproduce (see the new issue template).

AndreasMadsen · 2016-10-22T14:10:31Z

The steps are the same @nouiz describe. python -c "import tensorflow as tf;tf.InteractiveSession()" fails with a Segmentation fault. Adding CUDA_VISIBLE_DEVICES=1 fixes it.

I have a hunch that it might be related to mixing GPUs with diffrent CUDA capabilities. But I have no way to test that.

Sat Oct 22 16:07:37 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39     Driver Version: 352.39         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          On   | 0000:02:00.0     Off |                    0 |
| 23%   23C    P8    18W / 235W |     22MiB / 11519MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          On   | 0000:03:00.0     Off |                    0 |
| 30%   20C    P8    16W / 225W |     13MiB /  4799MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20c          On   | 0000:83:00.0     Off |                    0 |
| 30%   22C    P8    16W / 225W |     13MiB /  4799MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  On   | 0000:84:00.0     Off |                  N/A |
| 22%   31C    P8    29W / 250W |    161MiB / 12287MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

yaroslavvb · 2017-01-18T17:05:24Z

@cbquillen do I understand it correctly that the problem arises only when you use CUDA_VISIBLE_DEVICES mechanism to select the GPU device used by TensorFlow, and some other mechanism (which?) to select GPU device used by another program?

vrv · 2017-01-18T17:16:20Z

Indeed, visible_device_list deals with what is provided at runtime via the cuGetDevice API -- how external drivers and processes remap the physical ID to the 'visible id' is out of TensorFlow's control, and it sounds like you agree, that visible_device_list must use f(N), not N, since f is not something TF controls.

I believe our documentation is pretty clear on this point: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto#L48 . -- let us know if there's any further documentation that might help?

cbquillen · 2017-01-18T17:17:22Z

@yaroslavvb

I am running into this in the case where exclusive GPU access is turned on and there are other processes using some GPUs on the machine. If I find a free GPU subset using nvidia-smi (or the NVML library) and try to limit Tensorflow to using this subset via CUDA_VISIBLE_DEVICES or gpu_options.visible_device_list in the Tensorflow config_proto, the result is a request for busy GPUs unless I use the mapping derived by looking at PCI-ids.

yaroslavvb · 2017-01-18T17:25:29Z

Hm, that's kind of annoying actually, I've also been parsing nvidia-smi to generate export CUDA_VISIBLE_DEVICES= command on shared computers (along the lines of setup_one_gpu). Haven't run into problems yet, but it would be nice to have some solution that works reliably.

cbquillen · 2017-01-18T17:29:23Z

@vrv

The problem is that users will see the ordering provided by nvidia-smi (sorted by PCI-Id) and try to use it, not realizing it actually has nothing to do with what CUDA_VISIBLE_DEVICES actually selects. Now a lot of people think this is an NVidia bug, and it's possible that they agree and fix it in newer drivers. However it might be worth adding a prominent warning in the documentation to the effect that this problem exists. Otherwise you will continue to get queries about it.

vrv · 2017-01-18T17:34:11Z

I agree that this is a problem. I'm not sure if NVidia would prefer people to not use the cuGetDevice API in favor of whatever nvidia-smi is using. Perhaps @benbarsdell or @cliffwoolley might know?

In lieu of that, would you like to add the warning to the documentation somewhere? You probably know best where others like you would go looking for it :)

cbquillen · 2017-01-18T17:50:34Z

@vrv
A short warning in tensorflow.Session about using CUDA_VISIBLE_DEVICES with a link would probably help. The visible_device_list doc config.proto would be a place where the a complete problem is described and where you would link to.

stracing nvidia-smi reveals that it loads libnvidia-ml.so. So it is using NVML and so could Tensorflow.

vrv · 2017-01-18T17:53:54Z

I was thinking documentation rather than source code might be better: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/using_gpu/index.md#using-gpus as a possible area to write a short section on it?

cc @zheng-xq: I know he had reservations about adding visible_device_list for this exact reason of confusion -- perhaps he can weigh in if he wants.

noisychannel · 2017-01-18T18:17:20Z

This may be helpful to some users who are on a cluster. Often, you don't have information about which GPU has been allocated to you or which one to occupy if free, with CUDA_VISIBLE_DEVICES.

I use this script to (https://gist.github.com/noisychannel/cdf57e2f177e98ae653230323a093d1e) to return the first free GPU and then use something like :

CUDA_VISIBLE_DEVICES=`scripts/free-gpu` python

Edit : This is the parsed nvidia-smi output that @yaroslavvb mentioned earlier in this thread.

cbquillen · 2017-01-19T02:18:30Z

@yaroslavvb
Looking at the NVIDIA docs more, I think you can get your setup_one_gpu script to work reliably if you set the CUDA_DEVICE_ORDER environment variable to 'PCI_BUS_ID'.

@zheng-xq A cheezy solution that would probably reduce the likelihood of complaints (but create some less-likely new problems) would be to force set CUDA_DEVICE_ORDER='PCI_BUS_ID' within Tensorflow.

fixes b/28821889

The SYCL ApplyGradientDescent operation does not need to copy the scalar `lr` back to the host, but can use broadcasting to read the scalar on the device.

tensorflowbutler · 2017-12-22T07:48:44Z

It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly.

zheng-xq · 2017-12-22T17:40:13Z

If someone is interested in adding CUDA_DEVICE_ORDER='PCI_BUS_ID' support in TensorFlow, a contribution is welcome. But note that we can only override it if it is not already set. And it has to been before any potential Cuda calls are made.

cliffwoolley · 2017-12-22T18:00:11Z

I don't think I've actually seen an application that sets CUDA_DEVICE_ORDER on its own. Usually this would be set in the environment before launching the application, perhaps in combination with CUDA_VISIBLE_DEVICES.

cbquillen · 2017-12-22T18:07:41Z

You can certainly set environment variables within an application before the first call to tf.Session().

cliffwoolley · 2017-12-22T18:26:31Z

Of course it's possible. Just as in C you'd use setenv(). I was just saying that for those particular environment variables, the choice is a bit specific to the combination of application and machine the application is run on. So usually the user (or script) invoking the application would set those env vars rather than having the application set them unilaterally. I'd argue for example that it's unwise for TF itself to set CUDA_DEVICE_ORDER, because it would undo the CUDA feature that this variable sits on top of, which is to, by default, have the "best" device be device 0; this is handy when you don't know much about the system you're running on and are just trying things out. More sophisticated users just pick a GPU or, if really fancy, pick one based on NVML output as we're discussing here. But we shouldn't, IMO, assume all users are so sophisticated.

imagefun · 2018-04-12T18:07:31Z

export CUDA_VISIBLE_DEVICES=0 solves my issue!

wt-huang · 2018-11-10T03:14:20Z

Closing as this is resolved, free to reopen if problem persists.

saikrishnadas · 2020-04-04T06:22:55Z

Python 3.6.9
Tensorflow 2.1.0 gpu supported
Cuda 10.0
TensorRT 6.0
GPU - 1
Is still get the error even after export CUDA_VISIBLE_DEVICES=0 or export CUDA_VISIBLE_DEVICES=1

Error log :

Can't identify the cuda device. Running on device 0
Segmentation fault (core dumped)

vrv closed this as completed Nov 13, 2015

vrv reopened this Dec 11, 2015

erickrf mentioned this issue May 23, 2016

Memory error running tensorflow code on GPU #2416

Closed

girving assigned zheng-xq Jun 6, 2016

girving added the triaged label Jun 6, 2016

michaelisard mentioned this issue Jul 27, 2016

core dump when initializing gpu #3525

Closed

aselle removed the triaged label Jul 28, 2016

serhan-gul mentioned this issue Aug 3, 2016

Cuda error: creating context when one is currently active #3612

Closed

aselle closed this as completed Aug 15, 2016

aselle reopened this Oct 24, 2016

cbquillen mentioned this issue Jan 18, 2017

Feature request: Ability to specify some GPUs and ignore all others #1888

Closed

AndreasMadsen mentioned this issue Jan 18, 2017

Update of bazel and tensorflow AndreasMadsen/my-setup#6

Closed

yaroslavvb mentioned this issue Mar 2, 2017

also set CUDA_DEVICE_ORDER bamos/setGPU#1

Closed

gunan mentioned this issue May 19, 2017

Tensorflow results in Segmenation fault #8946

Closed

hadaev8 mentioned this issue Jun 10, 2017

Cant start work with tensorflow, import error #10382

Closed

tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017

Removed extra parens (tensorflow#152)

f8cc1ea

fixes b/28821889

zheng-xq added the stat:contribution welcome label Dec 22, 2017

zheng-xq removed their assignment Dec 22, 2017

vincent-roye mentioned this issue Feb 8, 2018

Tensorflow-gpu 1.6 : failed call to cuInit: CUDA_ERROR_NO_DEVICE #16860

Closed

wt-huang closed this as completed Nov 10, 2018

brbfreitas mentioned this issue Aug 21, 2020

Can't install tensorflow - numpy incompatibility? #42556

Closed

Segmentation fault when GPUs are already used #152

Segmentation fault when GPUs are already used #152

Comments

nouiz commented Nov 11, 2015

zheng-xq commented Nov 11, 2015

skearnes commented Nov 12, 2015

vrv commented Nov 13, 2015

noisychannel commented Dec 11, 2015

vrv commented Dec 11, 2015

zheng-xq commented Dec 11, 2015

noisychannel commented Dec 11, 2015

zheng-xq commented Dec 11, 2015

noisychannel commented Dec 11, 2015

skearnes commented Dec 11, 2015

noisychannel commented Dec 11, 2015

jtrmal commented Dec 11, 2015

zheng-xq commented Dec 14, 2015

noisychannel commented Dec 14, 2015

girving commented Mar 8, 2016

noisychannel commented Mar 9, 2016

aselle commented Aug 15, 2016

AndreasMadsen commented Oct 21, 2016

aselle commented Oct 21, 2016

AndreasMadsen commented Oct 22, 2016

yaroslavvb commented Jan 18, 2017

vrv commented Jan 18, 2017

cbquillen commented Jan 18, 2017

yaroslavvb commented Jan 18, 2017 • edited Loading

cbquillen commented Jan 18, 2017

vrv commented Jan 18, 2017

cbquillen commented Jan 18, 2017

vrv commented Jan 18, 2017

noisychannel commented Jan 18, 2017 • edited Loading

cbquillen commented Jan 19, 2017 • edited Loading

tensorflowbutler commented Dec 22, 2017

zheng-xq commented Dec 22, 2017

cliffwoolley commented Dec 22, 2017

cbquillen commented Dec 22, 2017

cliffwoolley commented Dec 22, 2017

imagefun commented Apr 12, 2018

wt-huang commented Nov 10, 2018

saikrishnadas commented Apr 4, 2020 • edited Loading

yaroslavvb commented Jan 18, 2017 •

edited

Loading

noisychannel commented Jan 18, 2017 •

edited

Loading

cbquillen commented Jan 19, 2017 •

edited

Loading

saikrishnadas commented Apr 4, 2020 •

edited

Loading