Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when GPUs are already used #152

Closed
nouiz opened this issue Nov 11, 2015 · 42 comments
Closed

Segmentation fault when GPUs are already used #152

nouiz opened this issue Nov 11, 2015 · 42 comments
Labels
stat:contribution welcome Status - Contributions welcome

Comments

@nouiz
Copy link
Contributor

nouiz commented Nov 11, 2015

When I set the nvidia driver in exclusive mode and one of the GPU is already used by another process, I get a segmentation fault:

$ python -c "import tensorflow as tf;tf.InteractiveSession()"
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12
Segmentation fault (core dumped)

If I limit the visible GPUs to only GPUs that have nothing running on them, it don't segfault:

$CUDA_VISIBLE_DEVICES=1 python -c "import tensorflow as tf;tf.InteractiveSession()"
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:09:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 12105628263
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 12

Here is my output of nvidia-smi:

$ nvidia-smi 
Wed Nov 11 16:48:27 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750     Off  | 0000:05:00.0      On |                  N/A |
| N/A   48C    P8     0W /  38W |     25MiB /  2047MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:06:00.0     Off |                  N/A |
| 42%   82C    P2   127W / 250W |    262MiB / 12287MiB |     39%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:09:00.0     Off |                  N/A |
| 22%   46C    P8    17W / 250W |     23MiB / 12287MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:0A:00.0     Off |                  N/A |
| 22%   37C    P8    15W / 250W |    361MiB / 12287MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1171    G   /usr/bin/X                                      17MiB |
|    1     32740    C   python                                         209MiB |
|    3      9429    C   python                                         336MiB |
+-----------------------------------------------------------------------------+

I suppose that in the code that check the available GPUs, it don't handle correctly a case when one of the GPUs can't be used.

@zheng-xq
Copy link
Contributor

@nouiz, you are correct in your analysis. TensorFlow tries to grab all the GPU it sees from the system that passes its criteria. CUDA_VISIBLE_DEVICES is the solution I would suggest. Please let us know if that is not enough.

@skearnes
Copy link

@zheng-xq +1 on CUDA_VISIBLE_DEVICES; this fixes the issue for me.

@vrv vrv closed this as completed Nov 13, 2015
@vrv
Copy link

vrv commented Nov 13, 2015

Re-open if CUDA_VISIBLE_DEVICES isn't good enough

@noisychannel
Copy link

@vrv This is still a problem when there are multiple GPUs (in exclusive mode) on the machine and some of them have processes running on them. Setting CUDA_VISIBLE_DEVICES is not ideal if you do not actually know which GPU is available. nvidia-smi will only show the current state of the GPUs and there's no way of ensuring that while you spawn your process, no other process occupied the GPU you set for use. Is there a way of polling GPUs and assigning the process to the first available, compatible GPU on the machine instead of having to manually specify the GPU id?

PS: I cannot reopen this issue.

@vrv vrv reopened this Dec 11, 2015
@vrv
Copy link

vrv commented Dec 11, 2015

@zheng-xq: is it possible to query the status of the exclusive mode bit?

@zheng-xq
Copy link
Contributor

@noisychannel, even if we can query which GPU is available, and by the time we actually try to assign the GPU, it could be taken by other processes. This is not very different from having a script query GPU status through nvidia-smi, and set the GPU id accordingly.

The only way I can think of that makes your case easier is actually to create the context, and if it fails, ignore that failure and move on to the next. However, this is undesirable for many more common cases, where such context creation errors are signs of bigger problems.

@noisychannel
Copy link

The established way to do this seems to be:

  1. Set the GPUs to operate in exclusive mode.
  2. Do not call cudaSetDevice().
  3. When you run your program, it will try and create the context on the first GPU and fail if the GPU is busy. Assuming you're using CUDART, it should silently try and create the context on the next GPUs and fail if all GPUs are busy.

As for exception handling while creating the context, it seems like the this scenario can be handled with the exceptions returned by cuCtxCreate().

A solution like this is optimal when you delegate the scheduling of your jobs to something like SGE and cannot trust the script based querying approach which I think is common for large clusters.

@zheng-xq
Copy link
Contributor

This would work fine if the framework only wants to use a single GPU. TensorFlow is designed to use multiple GPU seamlessly at the same time. So at the initialization stage, it grabs all the visible devices that are compatible, since the client program might use all of them. The actual device used is only known at the graph construction/execution time, which is at a much later stage.

For most of our existing users, the GPUs that are available to a particular job is known when it starts. So it is okay for TensorFlow to take all of them. A few options to think about:

  1. Is it possible for the job scheduler to reserve the GPUs in your clusters? That would work best with TensorFlow.
  2. If you only know the list of candidate GPUs, and the number of GPUs to use, we can add a special mode where we try to create context in the candidate list, and stops if desired number of GPUs are taken. This requires more plumbing to both TensorFlow and the underlying Stream-Executor that actually manages GPUs. We would much prefer Add support for Python 3.x #1 if that is possible for you.

@noisychannel
Copy link

@zheng-xq For 1: AFAIK, the only responsibility of most job schedulers is to ensure that you will have access to the number of GPUs you have asked for on the machine which has been assigned to you. It leaves the actual CPU/GPU scheduling of the processes on that machine to the OS/Cuda driver. Also, hypothetically, if the scheduler was to attempt something of this sort, it would not be able to block the GPU without creating something like a context.

For 2 : I understand that this is probably painful. However, it does add a more general set of features. The case where I want k out of n GPUs on the machine is most prominent. Eg,, it would be able to handle the case where the candidate list is the set of all GPUs on the machine and I want only 2 available and compatible GPUs from that list. Having the ability to specify how many GPUs you want to use will be a great addition and this will tie in perfectly with most job schedulers. So to tensorflow, I specify, I want to run this thing on 2 GPUs and I tell the job scheduler that I need 2 GPUs.

@skearnes
Copy link

Drive-by comment: the sysadmin for my local SGE cluster was able to have an environment variable set that tells me which GPUs my jobs have been assigned. I then use this variable to set CUDA_VISIBLE_DEVICES. I would guess that most common schedulers have similar capabilities, although they may require root-level tweaking.

@noisychannel
Copy link

This is useful to know. However, something like this would require all jobs on the cluster to respect that environment variable when using GPUs. And it takes away some of the scheduling responsibility that should ideally belong to the CUDA driver, adding another layer of complexity. In fact, most other toolkits that are GPU capable will not require something like this and as long as the GPUs are set in exclusive mode, they work just fine.

@jtrmal
Copy link

jtrmal commented Dec 11, 2015

Another drive-by comment: when the allocation fails, the app actually segfaults on our machines. That does not provide any possibility for the user to recover from it at the python level.
A good-enough solution would be
a) throw an exception so that user can it catch and try (for example) another device
b) ideally, adding /gpu:auto which would just iterate through the devices and allocate the one that is available.

@zheng-xq
Copy link
Contributor

@noisychannel, please see if you can work around this issue at the moment. Meanwhile, I will investigate if we can officially support the soft-try-and-device-limit approach in TensorFlow.

@noisychannel
Copy link

Okay. Thanks. Keep me posted.

@girving
Copy link
Contributor

girving commented Mar 8, 2016

@zheng-xq: Is this still an issue?

@noisychannel
Copy link

I'm not aware of a "soft-try-and-assign-device" implementation in Tensorflow.

@aselle
Copy link
Contributor

aselle commented Aug 15, 2016

Closing this automatically due to lack of recent activity. Please reopen when new information becomes available. @zheng-xq, please re-open if you plan to be still working on this.

@aselle aselle closed this as completed Aug 15, 2016
@AndreasMadsen
Copy link
Contributor

@aselle I'm on a shared multi GPU machine and this doesn't appear to have been fixed.

@aselle
Copy link
Contributor

aselle commented Oct 21, 2016

@AndreasMadsen, what is your exact hardware/software configuration and steps to reproduce (see the new issue template).

@AndreasMadsen
Copy link
Contributor

The steps are the same @nouiz describe. python -c "import tensorflow as tf;tf.InteractiveSession()" fails with a Segmentation fault. Adding CUDA_VISIBLE_DEVICES=1 fixes it.

I have a hunch that it might be related to mixing GPUs with diffrent CUDA capabilities. But I have no way to test that.

Sat Oct 22 16:07:37 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39     Driver Version: 352.39         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          On   | 0000:02:00.0     Off |                    0 |
| 23%   23C    P8    18W / 235W |     22MiB / 11519MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          On   | 0000:03:00.0     Off |                    0 |
| 30%   20C    P8    16W / 225W |     13MiB /  4799MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20c          On   | 0000:83:00.0     Off |                    0 |
| 30%   22C    P8    16W / 225W |     13MiB /  4799MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  On   | 0000:84:00.0     Off |                  N/A |
| 22%   31C    P8    29W / 250W |    161MiB / 12287MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

@aselle aselle reopened this Oct 24, 2016
@yaroslavvb
Copy link
Contributor

@cbquillen do I understand it correctly that the problem arises only when you use CUDA_VISIBLE_DEVICES mechanism to select the GPU device used by TensorFlow, and some other mechanism (which?) to select GPU device used by another program?

@vrv
Copy link

vrv commented Jan 18, 2017

Indeed, visible_device_list deals with what is provided at runtime via the cuGetDevice API -- how external drivers and processes remap the physical ID to the 'visible id' is out of TensorFlow's control, and it sounds like you agree, that visible_device_list must use f(N), not N, since f is not something TF controls.

I believe our documentation is pretty clear on this point: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto#L48 . -- let us know if there's any further documentation that might help?

@cbquillen
Copy link

@yaroslavvb

I am running into this in the case where exclusive GPU access is turned on and there are other processes using some GPUs on the machine. If I find a free GPU subset using nvidia-smi (or the NVML library) and try to limit Tensorflow to using this subset via CUDA_VISIBLE_DEVICES or gpu_options.visible_device_list in the Tensorflow config_proto, the result is a request for busy GPUs unless I use the mapping derived by looking at PCI-ids.

@yaroslavvb
Copy link
Contributor

yaroslavvb commented Jan 18, 2017

Hm, that's kind of annoying actually, I've also been parsing nvidia-smi to generate export CUDA_VISIBLE_DEVICES= command on shared computers (along the lines of setup_one_gpu). Haven't run into problems yet, but it would be nice to have some solution that works reliably.

@cbquillen
Copy link

@vrv

The problem is that users will see the ordering provided by nvidia-smi (sorted by PCI-Id) and try to use it, not realizing it actually has nothing to do with what CUDA_VISIBLE_DEVICES actually selects. Now a lot of people think this is an NVidia bug, and it's possible that they agree and fix it in newer drivers. However it might be worth adding a prominent warning in the documentation to the effect that this problem exists. Otherwise you will continue to get queries about it.

@vrv
Copy link

vrv commented Jan 18, 2017

I agree that this is a problem. I'm not sure if NVidia would prefer people to not use the cuGetDevice API in favor of whatever nvidia-smi is using. Perhaps @benbarsdell or @cliffwoolley might know?

In lieu of that, would you like to add the warning to the documentation somewhere? You probably know best where others like you would go looking for it :)

@cbquillen
Copy link

@vrv
A short warning in tensorflow.Session about using CUDA_VISIBLE_DEVICES with a link would probably help. The visible_device_list doc config.proto would be a place where the a complete problem is described and where you would link to.

stracing nvidia-smi reveals that it loads libnvidia-ml.so. So it is using NVML and so could Tensorflow.

@vrv
Copy link

vrv commented Jan 18, 2017

I was thinking documentation rather than source code might be better: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/using_gpu/index.md#using-gpus as a possible area to write a short section on it?

cc @zheng-xq: I know he had reservations about adding visible_device_list for this exact reason of confusion -- perhaps he can weigh in if he wants.

@noisychannel
Copy link

noisychannel commented Jan 18, 2017

This may be helpful to some users who are on a cluster. Often, you don't have information about which GPU has been allocated to you or which one to occupy if free, with CUDA_VISIBLE_DEVICES.

I use this script to (https://gist.github.com/noisychannel/cdf57e2f177e98ae653230323a093d1e) to return the first free GPU and then use something like :

CUDA_VISIBLE_DEVICES=`scripts/free-gpu` python 

Edit : This is the parsed nvidia-smi output that @yaroslavvb mentioned earlier in this thread.

@cbquillen
Copy link

cbquillen commented Jan 19, 2017

@yaroslavvb
Looking at the NVIDIA docs more, I think you can get your setup_one_gpu script to work reliably if you set the CUDA_DEVICE_ORDER environment variable to 'PCI_BUS_ID'.

@zheng-xq A cheezy solution that would probably reduce the likelihood of complaints (but create some less-likely new problems) would be to force set CUDA_DEVICE_ORDER='PCI_BUS_ID' within Tensorflow.

tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017
lukeiwanski pushed a commit to codeplaysoftware/tensorflow that referenced this issue Oct 26, 2017
The SYCL ApplyGradientDescent operation does not need to copy the scalar
`lr` back to the host, but can use broadcasting to read the scalar on
the device.
@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly.

@zheng-xq zheng-xq added the stat:contribution welcome Status - Contributions welcome label Dec 22, 2017
@zheng-xq zheng-xq removed their assignment Dec 22, 2017
@zheng-xq
Copy link
Contributor

If someone is interested in adding CUDA_DEVICE_ORDER='PCI_BUS_ID' support in TensorFlow, a contribution is welcome. But note that we can only override it if it is not already set. And it has to been before any potential Cuda calls are made.

@cliffwoolley
Copy link
Contributor

I don't think I've actually seen an application that sets CUDA_DEVICE_ORDER on its own. Usually this would be set in the environment before launching the application, perhaps in combination with CUDA_VISIBLE_DEVICES.

@cbquillen
Copy link

You can certainly set environment variables within an application before the first call to tf.Session().

@cliffwoolley
Copy link
Contributor

Of course it's possible. Just as in C you'd use setenv(). I was just saying that for those particular environment variables, the choice is a bit specific to the combination of application and machine the application is run on. So usually the user (or script) invoking the application would set those env vars rather than having the application set them unilaterally. I'd argue for example that it's unwise for TF itself to set CUDA_DEVICE_ORDER, because it would undo the CUDA feature that this variable sits on top of, which is to, by default, have the "best" device be device 0; this is handy when you don't know much about the system you're running on and are just trying things out. More sophisticated users just pick a GPU or, if really fancy, pick one based on NVML output as we're discussing here. But we shouldn't, IMO, assume all users are so sophisticated.

@imagefun
Copy link

export CUDA_VISIBLE_DEVICES=0 solves my issue!

@wt-huang
Copy link

Closing as this is resolved, free to reopen if problem persists.

@saikrishnadas
Copy link

saikrishnadas commented Apr 4, 2020

Python 3.6.9
Tensorflow 2.1.0 gpu supported
Cuda 10.0
TensorRT 6.0
GPU - 1
Is still get the error even after export CUDA_VISIBLE_DEVICES=0 or export CUDA_VISIBLE_DEVICES=1

Error log :

Can't identify the cuda device. Running on device 0
Segmentation fault (core dumped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:contribution welcome Status - Contributions welcome
Projects
None yet
Development

No branches or pull requests