-
Notifications
You must be signed in to change notification settings - Fork 74.7k
Segmentation fault when GPUs are already used #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@nouiz, you are correct in your analysis. TensorFlow tries to grab all the GPU it sees from the system that passes its criteria. CUDA_VISIBLE_DEVICES is the solution I would suggest. Please let us know if that is not enough. |
@zheng-xq +1 on CUDA_VISIBLE_DEVICES; this fixes the issue for me. |
Re-open if CUDA_VISIBLE_DEVICES isn't good enough |
@vrv This is still a problem when there are multiple GPUs (in exclusive mode) on the machine and some of them have processes running on them. Setting CUDA_VISIBLE_DEVICES is not ideal if you do not actually know which GPU is available. nvidia-smi will only show the current state of the GPUs and there's no way of ensuring that while you spawn your process, no other process occupied the GPU you set for use. Is there a way of polling GPUs and assigning the process to the first available, compatible GPU on the machine instead of having to manually specify the GPU id? PS: I cannot reopen this issue. |
@zheng-xq: is it possible to query the status of the exclusive mode bit? |
@noisychannel, even if we can query which GPU is available, and by the time we actually try to assign the GPU, it could be taken by other processes. This is not very different from having a script query GPU status through nvidia-smi, and set the GPU id accordingly. The only way I can think of that makes your case easier is actually to create the context, and if it fails, ignore that failure and move on to the next. However, this is undesirable for many more common cases, where such context creation errors are signs of bigger problems. |
The established way to do this seems to be:
As for exception handling while creating the context, it seems like the this scenario can be handled with the exceptions returned by A solution like this is optimal when you delegate the scheduling of your jobs to something like SGE and cannot trust the script based querying approach which I think is common for large clusters. |
This would work fine if the framework only wants to use a single GPU. TensorFlow is designed to use multiple GPU seamlessly at the same time. So at the initialization stage, it grabs all the visible devices that are compatible, since the client program might use all of them. The actual device used is only known at the graph construction/execution time, which is at a much later stage. For most of our existing users, the GPUs that are available to a particular job is known when it starts. So it is okay for TensorFlow to take all of them. A few options to think about:
|
@zheng-xq For 1: AFAIK, the only responsibility of most job schedulers is to ensure that you will have access to the number of GPUs you have asked for on the machine which has been assigned to you. It leaves the actual CPU/GPU scheduling of the processes on that machine to the OS/Cuda driver. Also, hypothetically, if the scheduler was to attempt something of this sort, it would not be able to block the GPU without creating something like a context. For 2 : I understand that this is probably painful. However, it does add a more general set of features. The case where I want |
Drive-by comment: the sysadmin for my local SGE cluster was able to have an environment variable set that tells me which GPUs my jobs have been assigned. I then use this variable to set CUDA_VISIBLE_DEVICES. I would guess that most common schedulers have similar capabilities, although they may require root-level tweaking. |
This is useful to know. However, something like this would require all jobs on the cluster to respect that environment variable when using GPUs. And it takes away some of the scheduling responsibility that should ideally belong to the CUDA driver, adding another layer of complexity. In fact, most other toolkits that are GPU capable will not require something like this and as long as the GPUs are set in exclusive mode, they work just fine. |
Another drive-by comment: when the allocation fails, the app actually segfaults on our machines. That does not provide any possibility for the user to recover from it at the python level. |
@noisychannel, please see if you can work around this issue at the moment. Meanwhile, I will investigate if we can officially support the soft-try-and-device-limit approach in TensorFlow. |
Okay. Thanks. Keep me posted. |
@zheng-xq: Is this still an issue? |
I'm not aware of a "soft-try-and-assign-device" implementation in Tensorflow. |
Closing this automatically due to lack of recent activity. Please reopen when new information becomes available. @zheng-xq, please re-open if you plan to be still working on this. |
@aselle I'm on a shared multi GPU machine and this doesn't appear to have been fixed. |
@AndreasMadsen, what is your exact hardware/software configuration and steps to reproduce (see the new issue template). |
The steps are the same @nouiz describe. I have a hunch that it might be related to mixing GPUs with diffrent CUDA capabilities. But I have no way to test that.
|
@cbquillen do I understand it correctly that the problem arises only when you use |
Indeed, visible_device_list deals with what is provided at runtime via the cuGetDevice API -- how external drivers and processes remap the physical ID to the 'visible id' is out of TensorFlow's control, and it sounds like you agree, that visible_device_list must use f(N), not N, since f is not something TF controls. I believe our documentation is pretty clear on this point: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto#L48 . -- let us know if there's any further documentation that might help? |
I am running into this in the case where exclusive GPU access is turned on and there are other processes using some GPUs on the machine. If I find a free GPU subset using nvidia-smi (or the NVML library) and try to limit Tensorflow to using this subset via CUDA_VISIBLE_DEVICES or gpu_options.visible_device_list in the Tensorflow config_proto, the result is a request for busy GPUs unless I use the mapping derived by looking at PCI-ids. |
Hm, that's kind of annoying actually, I've also been parsing |
The problem is that users will see the ordering provided by nvidia-smi (sorted by PCI-Id) and try to use it, not realizing it actually has nothing to do with what CUDA_VISIBLE_DEVICES actually selects. Now a lot of people think this is an NVidia bug, and it's possible that they agree and fix it in newer drivers. However it might be worth adding a prominent warning in the documentation to the effect that this problem exists. Otherwise you will continue to get queries about it. |
I agree that this is a problem. I'm not sure if NVidia would prefer people to not use the cuGetDevice API in favor of whatever nvidia-smi is using. Perhaps @benbarsdell or @cliffwoolley might know? In lieu of that, would you like to add the warning to the documentation somewhere? You probably know best where others like you would go looking for it :) |
@vrv stracing nvidia-smi reveals that it loads libnvidia-ml.so. So it is using NVML and so could Tensorflow. |
I was thinking documentation rather than source code might be better: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/using_gpu/index.md#using-gpus as a possible area to write a short section on it? cc @zheng-xq: I know he had reservations about adding visible_device_list for this exact reason of confusion -- perhaps he can weigh in if he wants. |
This may be helpful to some users who are on a cluster. Often, you don't have information about which GPU has been allocated to you or which one to occupy if free, with CUDA_VISIBLE_DEVICES. I use this script to (https://gist.github.com/noisychannel/cdf57e2f177e98ae653230323a093d1e) to return the first free GPU and then use something like :
Edit : This is the parsed nvidia-smi output that @yaroslavvb mentioned earlier in this thread. |
@yaroslavvb @zheng-xq A cheezy solution that would probably reduce the likelihood of complaints (but create some less-likely new problems) would be to force set CUDA_DEVICE_ORDER='PCI_BUS_ID' within Tensorflow. |
fixes b/28821889
The SYCL ApplyGradientDescent operation does not need to copy the scalar `lr` back to the host, but can use broadcasting to read the scalar on the device.
It has been 14 days with no activity and this issue has an assignee.Please update the label and/or status accordingly. |
If someone is interested in adding CUDA_DEVICE_ORDER='PCI_BUS_ID' support in TensorFlow, a contribution is welcome. But note that we can only override it if it is not already set. And it has to been before any potential Cuda calls are made. |
I don't think I've actually seen an application that sets CUDA_DEVICE_ORDER on its own. Usually this would be set in the environment before launching the application, perhaps in combination with CUDA_VISIBLE_DEVICES. |
You can certainly set environment variables within an application before the first call to tf.Session(). |
Of course it's possible. Just as in C you'd use setenv(). I was just saying that for those particular environment variables, the choice is a bit specific to the combination of application and machine the application is run on. So usually the user (or script) invoking the application would set those env vars rather than having the application set them unilaterally. I'd argue for example that it's unwise for TF itself to set CUDA_DEVICE_ORDER, because it would undo the CUDA feature that this variable sits on top of, which is to, by default, have the "best" device be device 0; this is handy when you don't know much about the system you're running on and are just trying things out. More sophisticated users just pick a GPU or, if really fancy, pick one based on NVML output as we're discussing here. But we shouldn't, IMO, assume all users are so sophisticated. |
|
Closing as this is resolved, free to reopen if problem persists. |
Python 3.6.9 Error log :
|
When I set the nvidia driver in exclusive mode and one of the GPU is already used by another process, I get a segmentation fault:
If I limit the visible GPUs to only GPUs that have nothing running on them, it don't segfault:
Here is my output of nvidia-smi:
I suppose that in the code that check the available GPUs, it don't handle correctly a case when one of the GPUs can't be used.
The text was updated successfully, but these errors were encountered: