-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redis Queue Worker Gets Stuck #2906
Comments
I've observed exactly the same issue on Amazon and logged my input at #2606. At the end I figured out listing all redis connections with |
@domenkozar is that your solution at the moment? (To kill those connections in a cronjob?) |
Yes. |
I've always suspected kombu connection handling code, but never really had time to dig into it. |
Was it always stuck reading from the socket? Or were there any syscalls in between? That doing Redis-py is a blocking library, and kombu does not actually put the socket in nonblocking mode. I can imagine a situation where the event loop tells us that the socket is readable, then something else in redis-py reads from the socket, and then kombu starts the blocking read when there's nothing to actually read. I don't know how difficult it is to use a non-blocking connection with redis-py, but that could possibly solve some of these issues. |
As far as I remember it's stuck there, no other syscalls.
I am actually doing that in rb because of similar worries about things interfering. Look here: https://github.com/getsentry/rb/blob/fb9935fb750427338d90d46ca9af3c740360a31a/rb/clients.py#L31-L63 |
Think i'm also seeing this too. Again its on AWS, ubuntu trusty 14.04 for the worker and using ElastiCache Redis (2.8.19) for the queue/results. When it happens the remote control interface is also taken out, so Not noticed any pattern to it yet. When it happens the worker processes will be receiving from a pipe (im assuming the master process and completely normal?):
However the master process gets stuck in a
(Which is the redis box). Killing client Version wise:
(Pre-fork as well, with 20 worker processes). |
Oh, one more thing to check: the issue started to appear once redis more or less ran out of ram. Not sure if that was relevant or not. |
For us, the redis monitoring says we've not dipped below 1.8GB of free memory. |
I did not look at memory when playing with it, but i deployed it on a micro machine for testing so I would not be surprised if it went low on memory. |
I'm seeing this and hopefully I can shed some light. I'm running an older version of Sentry that uses Celery 3.0. As well, I'm running a separate worker pool (unrelated to Sentry) under Celery 3.1. The different worker pools each use their own dedicated Redis instance as a broker. This setup has been rock stable for me in two separate datacenters for over a year and I've never seen any hung workers. (The Sentry pool runs 6x6 workers on two hosts, the other pool runs 6x6 workers on 4 hosts. By 6x6 I mean I'm running 6 independent worker processes each with Recently, I've been working on setting up HA-Proxy as an outbound proxy on my worker nodes. So Celery/Kombu connects to localhost, which is HA-Proxy, then HA-Proxy connects on to Redis. (I'm using HA-Proxy to detect a master/slave Redis failover and reconnect to the new master.) And all of a sudden I'm seeing these worker hangs, both with the Sentry Celery 3.0 workers and the other Celery 3.1 workers. Now here's the interesting part. I've configured HA-Proxy to drop idle TCP connections after 2 hours. And this is exactly the time frame in which I'm seeing the workers hang. This is surprising to me because I've been running Redis with a timeout of 2 hours since the beginning. The only difference I can think is according the the Redis docs, it doesn't apply timeouts to any pub/sub connections. So a pub/sub connection w/o any activity would be timed out by HA-Proxy, but not normally by Redis. I suspect that HA-Proxy is dropping the pub/sub connection after 2 hours, but Kombu isn't noticing somehow? Either that or there's something different about how Redis drops an idle connection vs how HA-Proxy deals with it. What I'm really confused about is why there are idle connections in the first place and why dropping them would cause the Celery workers to hang. These are all very busy workers. I'm going to try bumping up the HA-Proxy timeout to 24h and add a Versions:
Seeing the same problem with both pools after trying to use HA-Proxy. Some log entries:
And it all looks good again. So the worker pool hung after 2 hours, then spontaneously recovered 2 hours after that. Another worker, same thing:
One more thing I notice... the last task before the pool hangs is completing (as evidenced by the |
I'm almost certain this is an issue with partially closed sockets. I have seen really weird things in the past with those on a variety of systems and as a result I typically run everything with TCP keepalives enabled. I will do another test to see if that changes anything. |
I won't have more time to look at this till next week, but I'll try to repro it with shorter timeouts in HA-Proxy then. I think when I saw this I had keep-alives enabled on HA-Proxy, which may be why it recovered after two hours (I think that's the Linux default before it starts sending KA's). |
See this today :( Situation: Workaround: Celery worker restart |
Hello, at a first glance does not look like a Redis problem. To have a
Don't remember if there are other cases, but quickly grepping the Redis source code suggests only the above two are possible. Both look suspicious, and I would bet more on condition number 1, that is, connections are created by workers and not used for some reason. |
A few more ideas:
EDIT: Important, note that in the past |
Thanks so much for the reply @antirez. My hunch is that the reason it shows up with low memory is just a side effect of the workers not processing any more so you eventually have too much stuff on the queues. In my case I did end up with loads of stuff on it, but that's because pushing always worked, just popping no longer did. Since that was a micro instance on amazon it does not take very long to exhaust memory on there. |
@antirez an example listing with |
@mitsuhiko makes sense indeed, btw @domenkozar output shows that my guess was correct:
The connection is never used, just created. It's like if celery waits for a reply without sending a command at all or alike. Maybe it fails to subscribe and enters directly the code path where it listens for messages? |
So I did not have any changes more to investigate this yet, but I'm quite sure that something along these lines happens based random epoll errors coming up in stacktraces way too often:
Now with a normal socket that should not be possible from my understanding as it should fail with |
Please try reproducing with this patch enabled, as I have hunch this could be the reason this is happening. I installed a handler for redis-py.Connection.disconnect, but it only removes the socket from the event loop, and it does not reset the |
I will do a test with this over the weekend and keep you posted. |
I had an issue very similar to what's been described. We just upgraded to the latest Kombu release (which includes celery/kombu@4681b6e) and we haven't had to restart all our workers in over 48 hours :) |
That sounds good, @frewsxcv, thanks for reporting back! Are you on AWS? |
Yep |
It's been a week, so assuming it has been fixed |
@ask i've also seen the same thing happening to our workers. After the workers running for a while, they will not consume anything from the queues and the only thing left is to restart them. The workers are running with
Here are the list of connections that 2 unresponsive workers have with redis:
|
@dqminh Please try upgrading to kombu 3.0.33, where it makes sure the consumer connection is not used for anything else. |
@ask i try this with gevent workers + hard time limit. Looks like a worker established a lot of connections and/or the connections werent collected properly ? This is for a single worker with gevent, CHILD=100, PREFETCH_MULTIPLIER=1
99 is a lot more connections than i expected ( and probably the same applies to 47 too :( ) |
@dqminh: You didn't mention you are using gevent, that must be a different issue. The connection count may make sense if you have |
updating to kombu 3.0.33 did not help for me. But... I didn't have |
Having the same issue and seems to be resolved by upgrading to kombu 3.0.35. Worth notice that on the other machine we have 3.0.29 on it and it doesn't seem to have this issue at all. So I would suggest that you always upgrate latest kombu 3.0.x and see if it works for you. |
Having the same issue. I've tried to reproduce this in our staging environment without any luck. We are on AWS (using elasticache) and the same setup has been running for literally years until quite recently when we have start seeing tasks piling up. Worth saying that we have been running the same set of "quite" old versions for a good while, so the only thing which might have change could be on the aws side (small redis bump during a maintenance window perhaps...)
Have anybody experienced this same issue but with only workers of a certain queue getting blocked? In order to do priorities we run several workers on each server, each one consuming tasks from different queues... and when we had this issue, only tasks of a certain queue did pile up, and not the others. On subsequent incidents, different queues got stuck btw. |
@jorgebastida How about your volume of tasks? We've seen the problem scale with increased utilization. |
Yes the hang use to happen just after (or during) a sustained spike in usage in the 150/s range. |
@jorgebastida Since the hang happens when using AWS&ElasticCache, so i think it may relate to this situation:
So before the network problem, if some of the queue has no jobs whiles others have, the connection maybe hang there forever for the empty queue. even if later there are jobs in the queue. because it connects to the read replica(the SLAVE), which remove all the blocking clients If you want to make sure whether this is the case for your situation, you can connection to the read replica and fire a SOLUTION: The redis replication codes are as follows:
|
Sorry for reviving such an old thread but this has been plaguing me for a few months. I have random nodes (in a 48 node cluster) that will at times go into the same state mentioned in this thread. The workers will scale up (max 10), and the parent process gets stuck in an endless loop
I am also running on AWS / ElasticCache but am only currently using 340MB / 28G so memory is not an issue.
Currently the only method to get this unstuck is a complete restart. This happens on average once a day, maybe a little less, to every node in the cluster. We average around 15 largish bodies tasks / second. |
Disregard. I just found #4185. I am testing the patch. |
maybe it helps to diagnose the issue with the following linke: http://memos.ml/ |
So I have been battling this issue for a few days now but it's really hard to figure out exactly what causes it. I had a repro for a day on an amazon machine but unfortunately it was a spot instance and I no longer have access to it. I tried to make another repro on the weekend and failed. Here is what I'm observing:
After some period of time everything will lock up. Ways I have seen this recover:
CLIENT KILL
from the redis server (this will trigger some epoll errors).We have a customer who also seems to have a related issue but it recovers after 30 minutes. I have not have a way to reproduce this issue, but it might be related to the client timeout on redis.
Things I know other than that:
I have a few theories on what this could be but I'm honestly at the end of my understanding of the issue. What I see is that kombu internally catches down random epoll errors and when I add some prints there I can see that I get epoll errors even though the socket seems alive and happy. I'm wondering if it could be possible that for some reason wrong fds are attempted to be registered with that thing.
Versions used:
I have since tried to reproduce this again on another amazon machine and it does not appear to happen. I'm not entirely sure what triggers it but it seems like the socket py-redis tries to read from is not what it thinks it is. The socket was from what I can tell in ESTABLISHED.
I'm completely out of ideas how to continue debugging this especially because it's impossible for me to reproduce on anything other than this particular amazon machine. It does not appear on OS X, on two linux VMs running in virtualbox, not happening from one of my linux servers on hetzner to another one there and I could also not reproduce it form hetzner to amazon though in that particular case I saw some other artifacts that might have prevented debugging this (in particular the performance was quite bad and it continued on for a day without incidents).
I first thought this mighe be #1845 but from what I can tell there is no high CPU usage involved. It just blocks on a read. If I can help in any way with investigating this further, please let me know.
The text was updated successfully, but these errors were encountered: