Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis Queue Worker Gets Stuck #2906

Closed
mitsuhiko opened this issue Nov 2, 2015 · 40 comments
Closed

Redis Queue Worker Gets Stuck #2906

mitsuhiko opened this issue Nov 2, 2015 · 40 comments

Comments

@mitsuhiko
Copy link
Contributor

So I have been battling this issue for a few days now but it's really hard to figure out exactly what causes it. I had a repro for a day on an amazon machine but unfortunately it was a spot instance and I no longer have access to it. I tried to make another repro on the weekend and failed. Here is what I'm observing:

  • two amazon machines (does not matter if in vpc or not)
  • install redis on one, celery on the other
  • run celery with four prefork workers
  • spam the queue with lots of tasks

After some period of time everything will lock up. Ways I have seen this recover:

  • restart celery
  • CLIENT KILL from the redis server (this will trigger some epoll errors).

We have a customer who also seems to have a related issue but it recovers after 30 minutes. I have not have a way to reproduce this issue, but it might be related to the client timeout on redis.

Things I know other than that:

  • strace: when it hangs, it's trying to read from a redis socket.
  • the read is _read_from_socket in SocketBuffer in py-redis
  • changing timeouts on the redis side changes nothing
  • enabling keepalives changes nothing
  • a dummy python script that does BLPOP in a loop against the redis server with a second script that puts items into the queue does not get stuck.

I have a few theories on what this could be but I'm honestly at the end of my understanding of the issue. What I see is that kombu internally catches down random epoll errors and when I add some prints there I can see that I get epoll errors even though the socket seems alive and happy. I'm wondering if it could be possible that for some reason wrong fds are attempted to be registered with that thing.

Versions used:

  • kombu==3.0.26
  • celery==3.1.18
  • redis==2.10.3
  • Redis server: 3.0.4
  • Happens with at least ubuntu 14.04

I have since tried to reproduce this again on another amazon machine and it does not appear to happen. I'm not entirely sure what triggers it but it seems like the socket py-redis tries to read from is not what it thinks it is. The socket was from what I can tell in ESTABLISHED.

I'm completely out of ideas how to continue debugging this especially because it's impossible for me to reproduce on anything other than this particular amazon machine. It does not appear on OS X, on two linux VMs running in virtualbox, not happening from one of my linux servers on hetzner to another one there and I could also not reproduce it form hetzner to amazon though in that particular case I saw some other artifacts that might have prevented debugging this (in particular the performance was quite bad and it continued on for a day without incidents).

I first thought this mighe be #1845 but from what I can tell there is no high CPU usage involved. It just blocks on a read. If I can help in any way with investigating this further, please let me know.

@domenkozar
Copy link

I've observed exactly the same issue on Amazon and logged my input at #2606. At the end I figured out listing all redis connections with CLIENT LIST and killing the one with cmd=NULL unblocks all clients, but never dig into why that connection exists or why is it blocking.

@mitsuhiko
Copy link
Contributor Author

@domenkozar is that your solution at the moment? (To kill those connections in a cronjob?)

@domenkozar
Copy link

Yes.

@domenkozar
Copy link

I've always suspected kombu connection handling code, but never really had time to dig into it.

@ask
Copy link
Contributor

ask commented Nov 2, 2015

Was it always stuck reading from the socket? Or were there any syscalls in between?

That doing client kill forces the worker to recover means that the socket is in the event loop and that the worker is reading from it. So not sure how it could be the wrong fd, unless they are both in the event loop.

Redis-py is a blocking library, and kombu does not actually put the socket in nonblocking mode. I can imagine a situation where the event loop tells us that the socket is readable, then something else in redis-py reads from the socket, and then kombu starts the blocking read when there's nothing to actually read.

I don't know how difficult it is to use a non-blocking connection with redis-py, but that could possibly solve some of these issues.

@mitsuhiko
Copy link
Contributor Author

Was it always stuck reading from the socket? Or were there any syscalls in between?

As far as I remember it's stuck there, no other syscalls.

I don't know how difficult it is to use a non-blocking connection with redis-py, but that could possibly solve some of these issues.

I am actually doing that in rb because of similar worries about things interfering. Look here: https://github.com/getsentry/rb/blob/fb9935fb750427338d90d46ca9af3c740360a31a/rb/clients.py#L31-L63

@Jc2k
Copy link

Jc2k commented Nov 5, 2015

Think i'm also seeing this too.

Again its on AWS, ubuntu trusty 14.04 for the worker and using ElastiCache Redis (2.8.19) for the queue/results.

When it happens the remote control interface is also taken out, so celery.control.inspect().ping() can't see the server (going to try and make use of that to have autoscaling remove instances that hang).

Not noticed any pattern to it yet.

When it happens the worker processes will be receiving from a pipe (im assuming the master process and completely normal?):

$ sudo strace -p 1484
Process 1484 attached
read(27, 
celery  1484 someuser   27r  FIFO    0,8      0t0   9937 pipe
$ sudo lsof +p 1484
COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME

However the master process gets stuck in a recvfrom:

$ sudo strace -p 1438
Process 1438 attached
recvfrom(114, 
sudo lsof +p 1438
COMMAND  PID   USER   FD   TYPE  DEVICE SIZE/OFF   NODE NAME
<snip>
celery  1438 user  114u  IPv4 1056394      0t0    TCP 10.1.1.81:33058->10.1.2.195:6379 (ESTABLISHED)

(Which is the redis box).

Killing client 10.1.2.195:6379 (which does show up in CLIENT LIST as 'cmd': 'NULL') allowed it to start processing tasks again. There was one other client in that state, which i killed as well, but the remote control API is still not able to see this celery worker.

Version wise:

kombu==3.0.26
celery==3.1.16
redis==2.10.3

(Pre-fork as well, with 20 worker processes).

@domenkozar
Copy link

Oh, one more thing to check: the issue started to appear once redis more or less ran out of ram. Not sure if that was relevant or not.

@Jc2k
Copy link

Jc2k commented Nov 6, 2015

For us, the redis monitoring says we've not dipped below 1.8GB of free memory.

@mitsuhiko
Copy link
Contributor Author

I did not look at memory when playing with it, but i deployed it on a micro machine for testing so I would not be surprised if it went low on memory.

@jaysoffian
Copy link

I'm seeing this and hopefully I can shed some light. I'm running an older version of Sentry that uses Celery 3.0. As well, I'm running a separate worker pool (unrelated to Sentry) under Celery 3.1. The different worker pools each use their own dedicated Redis instance as a broker. This setup has been rock stable for me in two separate datacenters for over a year and I've never seen any hung workers. (The Sentry pool runs 6x6 workers on two hosts, the other pool runs 6x6 workers on 4 hosts. By 6x6 I mean I'm running 6 independent worker processes each with --concurrency=6 on each host.)

Recently, I've been working on setting up HA-Proxy as an outbound proxy on my worker nodes. So Celery/Kombu connects to localhost, which is HA-Proxy, then HA-Proxy connects on to Redis. (I'm using HA-Proxy to detect a master/slave Redis failover and reconnect to the new master.)

And all of a sudden I'm seeing these worker hangs, both with the Sentry Celery 3.0 workers and the other Celery 3.1 workers. Now here's the interesting part. I've configured HA-Proxy to drop idle TCP connections after 2 hours. And this is exactly the time frame in which I'm seeing the workers hang. This is surprising to me because I've been running Redis with a timeout of 2 hours since the beginning. The only difference I can think is according the the Redis docs, it doesn't apply timeouts to any pub/sub connections. So a pub/sub connection w/o any activity would be timed out by HA-Proxy, but not normally by Redis.

I suspect that HA-Proxy is dropping the pub/sub connection after 2 hours, but Kombu isn't noticing somehow? Either that or there's something different about how Redis drops an idle connection vs how HA-Proxy deals with it. What I'm really confused about is why there are idle connections in the first place and why dropping them would cause the Celery workers to hang. These are all very busy workers.

I'm going to try bumping up the HA-Proxy timeout to 24h and add a socket_timeout to the BROKER_TRANSPORT_OPTIONS.

Versions:

  • celery==3.0.25, kombu==2.5.16, redis==2.10.5 (Sentry workers)
  • celery==3.1.19, kombu==3.0.29, redis==2.10.5 (Other workers)

Seeing the same problem with both pools after trying to use HA-Proxy. Some log entries:

[2015-11-11 23:39:51,220: INFO/MainProcess] Connected to redis://localhost:6380//
[2015-11-11 23:39:51,231: INFO/MainProcess] mingle: searching for neighbors
[2015-11-11 23:39:52,248: INFO/MainProcess] mingle: sync with 18 nodes
[2015-11-11 23:39:52,249: INFO/MainProcess] mingle: sync complete
[2015-11-11 23:39:52,264: INFO/MainProcess] celery@processor-00.proc1 ready.
[2015-11-11 23:40:12,254: INFO/MainProcess] missed heartbeat from celery@processor-05.proc4
... a bunch of missed heartbeat messages as other workers are restarting ...
[2015-11-11 23:41:34,004: INFO/MainProcess] sync with celery@processor-05.proc2
[2015-11-11 23:41:34,020: INFO/MainProcess] sync with celery@processor-01.proc2
[2015-11-11 23:41:34,051: INFO/MainProcess] sync with celery@processor-04.proc2
[2015-11-11 23:41:34,054: INFO/MainProcess] sync with celery@processor-00.proc2
[2015-11-11 23:41:34,193: INFO/MainProcess] sync with celery@processor-03.proc2
[2015-11-11 23:41:34,196: INFO/MainProcess] sync with celery@processor-02.proc2
[2015-11-11 23:41:34,485: INFO/MainProcess] Received task: crash_processor.tasks.process_crash[1f9444e7-e974-4901-ac3b-59bcee331ab4]

... lots and lots of tasks...

[2015-11-12 01:41:33,604: INFO/MainProcess] Received task: crash_processor.tasks.process_crash[2f2a2b3c-e410-47a6-b9ad-e282e04ae80b]

... no more tasks received.... :-(

And then...

[2015-11-12 01:41:34,334: INFO/Worker-4] [2f2a2b3c-e410-47a6-b9ad-e282e04ae80b] processed crash_id=99d88eb1a1be4a4691ea292cb1857020 in 0.728821s (sentry_rt=26.4ms)
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-02.proc1
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-03.proc1
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-01.proc1
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-04.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-05.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-05.proc4
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-02.proc4
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-03.proc4
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-01.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-04.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-00.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-03.proc3
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-00.proc3
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-05.proc3
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-04.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-01.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-02.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-00.proc2
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-05.proc2
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-01.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-04.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-02.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-03.proc2
[2015-11-12 03:41:34,209: INFO/MainProcess] Task crash_processor.tasks.process_crash[2f2a2b3c-e410-47a6-b9ad-e282e04ae80b] succeeded in 7200.60400889s: None
[2015-11-12 03:41:34,210: WARNING/MainProcess] Substantial drift from celery@processor-05.proc2 may mean clocks are out of sync.  Current drift is
7200 seconds.  [orig: 2015-11-12 03:41:34.210229 recv: 2015-11-12 01:41:34.394871]
[2015-11-12 03:41:34,211: INFO/MainProcess] Received task: crash_processor.tasks.process_crash[c2dd5b07-554b-48dc-951c-499e7b312710]

And it all looks good again. So the worker pool hung after 2 hours, then spontaneously recovered 2 hours after that. Another worker, same thing:

[2015-11-11 23:39:51,206: INFO/MainProcess] Connected to redis://localhost:6380//
[2015-11-11 23:39:51,221: INFO/MainProcess] mingle: searching for neighbors
[2015-11-11 23:39:52,239: INFO/MainProcess] mingle: sync with 18 nodes
[2015-11-11 23:39:52,239: INFO/MainProcess] mingle: sync complete
[2015-11-11 23:39:52,251: INFO/MainProcess] celery@processor-01.proc1 ready.
[2015-11-12 01:41:32,746: INFO/MainProcess] Received task: crash_processor.tasks.process_crash[7e67c9db-e962-48f2-9feb-2d53d05ab8e4]
[2015-11-12 01:41:34,824: INFO/Worker-2] [7e67c9db-e962-48f2-9feb-2d53d05ab8e4] processed crash_id=bb6f805558bb47e3b05938aa9804cdec in 2.076387s (sentry_rt=932.4ms)
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-00.proc1
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-02.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-03.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-04.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-05.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-05.proc4
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-02.proc4
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-03.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-01.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-04.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-00.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-03.proc3
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-00.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-05.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-04.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-01.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-02.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-00.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-05.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-01.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-04.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-02.proc2
[2015-11-12 03:41:34,208: INFO/MainProcess] missed heartbeat from celery@processor-03.proc2
[2015-11-12 03:41:34,210: INFO/MainProcess] Received task: crash_processor.tasks.process_crash[0681245d-c418-4c42-b8a5-acc2970bfd5c]
[2015-11-12 03:41:34,211: WARNING/MainProcess] Substantial drift from celery@processor-05.proc2 may mean clocks are out of sync.  Current drift is
7200 seconds.  [orig: 2015-11-12 03:41:34.211312 recv: 2015-11-12 01:41:34.394871]

[2015-11-12 03:41:34,212: INFO/MainProcess] Task crash_processor.tasks.process_crash[7e67c9db-e962-48f2-9feb-2d53d05ab8e4] succeeded in 7201.4646579s: None

One more thing I notice... the last task before the pool hangs is completing (as evidenced by the processed crash_id=bb6f805558bb47e3b05938aa9804cdec in 2.076387s message). But celery doesn't indicate the task is complete till after after the 2 hour hang and says the task took two hours to complete.

@mitsuhiko
Copy link
Contributor Author

I'm almost certain this is an issue with partially closed sockets. I have seen really weird things in the past with those on a variety of systems and as a result I typically run everything with TCP keepalives enabled. I will do another test to see if that changes anything.

@jaysoffian
Copy link

I won't have more time to look at this till next week, but I'll try to repro it with shorter timeouts in HA-Proxy then. I think when I saw this I had keep-alives enabled on HA-Proxy, which may be why it recovered after two hours (I think that's the Linux default before it starts sending KA's).

@mrkschan
Copy link

See this today :(

Situation:
Redis on Amazon Elasticcache: Having around 60-120MB freeable memory but using 13MB swap.
Redis CLIENT LIST: See cmd=NULL client, which is the Celery worker process.
Celery worker: Keeps on recv() from strace.
Redis SLOWLOG: Found quite a number of 10+ seconds celeryev PUBLISH.
Celery version: celery[redis]==3.1.19

Workaround: Celery worker restart

@antirez
Copy link

antirez commented Nov 28, 2015

Hello, at a first glance does not look like a Redis problem. To have a CLIENT LIST output when this happens would be useful anyway. Btw note that cmd=NULL happens in two cases:

  1. The connection never received commands so far.
  2. The latest command received was non recognized.

Don't remember if there are other cases, but quickly grepping the Redis source code suggests only the above two are possible. Both look suspicious, and I would bet more on condition number 1, that is, connections are created by workers and not used for some reason.

@antirez
Copy link

antirez commented Nov 28, 2015

A few more ideas:

  • Please tell me if you noticed this happening after a specific upgrade from an older version of Redis to a newer one.
  • Check if the Redis server is otherwise fine. But I guess so given you can use the CLIENT command.
  • It's not a solution, but you may want to set a client idle connections timeout in Redis as a quick&dirty way to fix this. See the timeout option in redis.conf.

EDIT: Important, note that in the past timeout 0 which means no timeout was not the default but now it is. So it is normal if after upgrading from old Redis versions you know see this. In the past the timeout on idle connection, that was enabled by default, could allow to recover automatically even if this is was a celery bug.

@mitsuhiko
Copy link
Contributor Author

Thanks so much for the reply @antirez. My hunch is that the reason it shows up with low memory is just a side effect of the workers not processing any more so you eventually have too much stuff on the queues. In my case I did end up with loads of stuff on it, but that's because pushing always worked, just popping no longer did. Since that was a micro instance on amazon it does not take very long to exhaust memory on there.

@domenkozar
Copy link

@antirez an example listing with cmd=NULL can be found here #2606 (comment)

@antirez
Copy link

antirez commented Nov 28, 2015

@mitsuhiko makes sense indeed, btw @domenkozar output shows that my guess was correct:

age=1142 idle=1142

The connection is never used, just created. It's like if celery waits for a reply without sending a command at all or alike. Maybe it fails to subscribe and enters directly the code path where it listens for messages?

@mitsuhiko
Copy link
Contributor Author

So I did not have any changes more to investigate this yet, but I'm quite sure that something along these lines happens based random epoll errors coming up in stacktraces way too often:

  • unlike normal redis python usage, kombu splits command + read from socket into two independent code paths
  • in between those two commands the state of the socket changes fundamentally. In particular, the socket actually closes and a new one opens with exactly the same fd
  • when kombu comes back to read from it, it's reading from a socket that never got a command.

Now with a normal socket that should not be possible from my understanding as it should fail with EBADF but that might not be the case if it goes through epoll where only the actual fd number is registered. Maybe the actual interaction is that the event for an earlier fd is now picked up by the code for the newer fd?

@ask
Copy link
Contributor

ask commented Dec 4, 2015

Please try reproducing with this patch enabled, as I have hunch this could be the reason this is happening.

I installed a handler for redis-py.Connection.disconnect, but it only removes the socket from the event loop, and it does not reset the _in_poll and _in_listen flags that will cause the channel to resend the redis commands.

@ask ask closed this as completed Dec 4, 2015
@ask ask reopened this Dec 4, 2015
@mitsuhiko
Copy link
Contributor Author

I will do a test with this over the weekend and keep you posted.

ask added a commit to celery/kombu that referenced this issue Dec 5, 2015

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@frewsxcv
Copy link
Contributor

frewsxcv commented Dec 9, 2015

I had an issue very similar to what's been described. We just upgraded to the latest Kombu release (which includes celery/kombu@4681b6e) and we haven't had to restart all our workers in over 48 hours :)

@ask
Copy link
Contributor

ask commented Dec 9, 2015

That sounds good, @frewsxcv, thanks for reporting back! Are you on AWS?

@frewsxcv
Copy link
Contributor

frewsxcv commented Dec 9, 2015

Are you on AWS?

Yep

@ask
Copy link
Contributor

ask commented Dec 12, 2015

It's been a week, so assuming it has been fixed

@ask ask closed this as completed Dec 12, 2015
@dqminh
Copy link

dqminh commented Jan 12, 2016

@ask i've also seen the same thing happening to our workers. After the workers running for a while, they will not consume anything from the queues and the only thing left is to restart them.

The workers are running with

  • celery-3.1.19
  • kombu-3.0.32
  • redis-2.10.3

Here are the list of connections that 2 unresponsive workers have with redis:

addr=1.2.3.4:35303 fd=7538 name= age=66388 idle=62524 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:35304 fd=7539 name= age=66388 idle=66388 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=info
addr=1.2.3.4:35307 fd=7565 name= age=66388 idle=66388 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sadd
addr=1.2.3.4:35309 fd=7571 name= age=66388 idle=66388 flags=N db=0 sub=0 psub=1 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=psubscribe
addr=1.2.3.4:36088 fd=1369 name= age=66383 idle=63005 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:36117 fd=1500 name= age=66382 idle=62561 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:36192 fd=1684 name= age=66382 idle=62540 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:36195 fd=2200 name= age=66382 idle=62820 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:36824 fd=2064 name= age=66378 idle=63194 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:44870 fd=9202 name= age=66203 idle=62820 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:45316 fd=12737 name= age=65388 idle=63194 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:47360 fd=12805 name= age=65088 idle=62561 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:49109 fd=7242 name= age=66308 idle=62013 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ttl
addr=1.2.3.4:49239 fd=7399 name= age=66305 idle=62820 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:50999 fd=10433 name= age=64595 idle=63194 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:51000 fd=12835 name= age=64595 idle=63046 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:51001 fd=12834 name= age=64595 idle=64560 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:51002 fd=12836 name= age=64595 idle=62595 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:51003 fd=12837 name= age=64595 idle=63012 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:51145 fd=12839 name= age=64595 idle=64561 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:51146 fd=12838 name= age=64595 idle=64561 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:56122 fd=12753 name= age=65365 idle=62824 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:56905 fd=12676 name= age=65632 idle=62824 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:57909 fd=8240 name= age=66006 idle=62287 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:59254 fd=12859 name= age=64569 idle=62561 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:59313 fd=12860 name= age=64568 idle=62561 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:59314 fd=12861 name= age=64568 idle=62561 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:59353 fd=12862 name= age=64567 idle=62338 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:59551 fd=8801 name= age=66259 idle=62561 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:59552 fd=8802 name= age=66259 idle=5 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ttl
addr=1.2.3.4:60026 fd=11594 name= age=65904 idle=64560 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:53461 fd=7197 name= age=67311 idle=3 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ttl
addr=1.2.3.4:46126 fd=11580 name= age=66907 idle=63822 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:39587 fd=4681 name= age=67374 idle=63822 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:44605 fd=9351 name= age=67203 idle=63822 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:63839 fd=12390 name= age=66691 idle=63822 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:37827 fd=12726 name= age=66397 idle=63822 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:37825 fd=12727 name= age=66397 idle=63822 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:37826 fd=12725 name= age=66397 idle=63826 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:39398 fd=4516 name= age=67375 idle=64372 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:40066 fd=4745 name= age=67373 idle=63564 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:40542 fd=5038 name= age=67370 idle=66089 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:36825 fd=12721 name= age=66401 idle=63825 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:36813 fd=12720 name= age=66401 idle=63826 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:39384 fd=4378 name= age=67375 idle=66078 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:42583 fd=8888 name= age=67212 idle=64006 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:38305 fd=12708 name= age=66396 idle=63822 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:37471 fd=12723 name= age=66398 idle=63823 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:39303 fd=4376 name= age=67376 idle=63822 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lpush
addr=1.2.3.4:57278 fd=9687 name= age=67149 idle=63823 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:61670 fd=606 name= age=66840 idle=63523 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:36352 fd=4152 name= age=67389 idle=67389 flags=N db=0 sub=0 psub=1 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=psubscribe
addr=1.2.3.4:36305 fd=3484 name= age=67389 idle=64006 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=brpop
addr=1.2.3.4:36306 fd=3985 name= age=67389 idle=67389 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=sadd
addr=1.2.3.4:36303 fd=1855 name= age=67389 idle=67389 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=info
addr=1.2.3.4:36298 fd=29 name= age=67389 idle=63527 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:38478 fd=3709 name= age=67379 idle=66089 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:57368 fd=12688 name= age=66589 idle=64372 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:38025 fd=12303 name= age=66397 idle=66089 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:38023 fd=12304 name= age=66397 idle=64022 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:38024 fd=12330 name= age=66397 idle=66089 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:38027 fd=12331 name= age=66397 idle=64009 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec
addr=1.2.3.4:38026 fd=12333 name= age=66397 idle=66089 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=exec

@ask
Copy link
Contributor

ask commented Jan 12, 2016

@dqminh Please try upgrading to kombu 3.0.33, where it makes sure the consumer connection is not used for anything else.

@dqminh
Copy link

dqminh commented Jan 13, 2016

@ask i try this with gevent workers + hard time limit. Looks like a worker established a lot of connections and/or the connections werent collected properly ?

This is for a single worker with gevent, CHILD=100, PREFETCH_MULTIPLIER=1

# new worker on kombu 3.0.33
> cat connection.list  | grep 6379 | wc -l
99

# old worker on kombu 3.0.32
> cat connection.list | grep 6379 | wc -l
47

99 is a lot more connections than i expected ( and probably the same applies to 47 too :( )

@ask
Copy link
Contributor

ask commented Jan 13, 2016

@dqminh: You didn't mention you are using gevent, that must be a different issue.

The connection count may make sense if you have -c 100 and store results in redis, but the changes in 3.0.33 could definitely have contributed to growing the number of connections.

@michi88
Copy link

michi88 commented Mar 3, 2016

updating to kombu 3.0.33 did not help for me. But... I didn't have BROKER_TRANSPORT_OPTIONS = {'socket_timeout': 5} and -Ofair. Trying with those now. Can Flower have anything todo with this also? As this started when we've added flower and redis as the (result) backend.

@meetwudi
Copy link

meetwudi commented Apr 22, 2016

Having the same issue and seems to be resolved by upgrading to kombu 3.0.35. Worth notice that on the other machine we have 3.0.29 on it and it doesn't seem to have this issue at all. So I would suggest that you always upgrate latest kombu 3.0.x and see if it works for you.

@jorgebastida
Copy link

Having the same issue. I've tried to reproduce this in our staging environment without any luck.

We are on AWS (using elasticache) and the same setup has been running for literally years until quite recently when we have start seeing tasks piling up.

Worth saying that we have been running the same set of "quite" old versions for a good while, so the only thing which might have change could be on the aws side (small redis bump during a maintenance window perhaps...)

celery==3.1.12
kombu==3.0.26
redis==2.9.1
redis-server 2.8.19 (elasticache)

Have anybody experienced this same issue but with only workers of a certain queue getting blocked? In order to do priorities we run several workers on each server, each one consuming tasks from different queues... and when we had this issue, only tasks of a certain queue did pile up, and not the others. On subsequent incidents, different queues got stuck btw.

@aarcro
Copy link
Contributor

aarcro commented Jul 5, 2016

@jorgebastida How about your volume of tasks? We've seen the problem scale with increased utilization.

@jorgebastida
Copy link

Yes the hang use to happen just after (or during) a sustained spike in usage in the 150/s range.

@yancl
Copy link

yancl commented Oct 10, 2016

@jorgebastida Since the hang happens when using AWS&ElasticCache, so i think it may relate to this situation:

  • A cache cluster with primary node in zone-a and read replica in zone-b
  • Access the cluster through primary endpoint(suppose which resolved to primary node through DNS)
  • Shoot a blpop at the the cache cluster and keep a long-live connection to the server
  • There is a network problem(jitter?) in aws
  • The cache monitor see that and promote the replica to be master
  • The cache monitor make a DNS update(make the primary endpoint point to read replica)

So before the network problem, if some of the queue has no jobs whiles others have, the connection maybe hang there forever for the empty queue. even if later there are jobs in the queue. because it connects to the read replica(the SLAVE), which remove all the blocking clients
from the blocked keys.

If you want to make sure whether this is the case for your situation, you can connection to the read replica and fire a client list to see if there is blpop comand when the hang appears:)

SOLUTION:
add a timeout to blpop or change the timeout 0 settings of the redis conf if upgrade the komku version does not help.

The redis replication codes are as follows:

//replication.c
/* Mass-unblock clients because something changed in the instance that makes
 * blocking no longer safe. For example clients blocked in list operations
 * in an instance which turns from master to slave is unsafe, so this function
 * is called when a master turns into a slave.
 *
 * The semantics is to send an -UNBLOCKED error to the client, disconnecting
 * it at the same time. */
void disconnectAllBlockedClients(void) {
    listNode *ln;
    listIter li;

    listRewind(server.clients,&li);
    while((ln = listNext(&li))) {
        redisClient *c = listNodeValue(ln);

        if (c->flags & REDIS_BLOCKED) {
            addReplySds(c,sdsnew(
                "-UNBLOCKED force unblock from blocking operation, "
                "instance state changed (master -> slave?)\r\n"));
            unblockClientWaitingData(c);
            c->flags |= REDIS_CLOSE_AFTER_REPLY;
        }
    }
}

//t_list.c
/* Unblock a client that's waiting in a blocking operation such as BLPOP */
void unblockClientWaitingData(redisClient *c) {
    dictEntry *de;
    dictIterator *di;
    list *l;

    redisAssertWithInfo(c,NULL,dictSize(c->bpop.keys) != 0);
    di = dictGetIterator(c->bpop.keys);
    /* The client may wait for multiple keys, so unblock it for every key. */
    while((de = dictNext(di)) != NULL) {
        robj *key = dictGetKey(de);

        /* Remove this client from the list of clients waiting for this key. */
        l = dictFetchValue(c->db->blocking_keys,key);
        redisAssertWithInfo(c,key,l != NULL);
        listDelNode(l,listSearchKey(l,c));
        /* If the list is empty we need to remove it to avoid wasting memory */
        if (listLength(l) == 0)
            dictDelete(c->db->blocking_keys,key);
    }
    dictReleaseIterator(di);

    /* Cleanup the client structure */
    dictEmpty(c->bpop.keys,NULL);
    if (c->bpop.target) {
        decrRefCount(c->bpop.target);
        c->bpop.target = NULL;
    }
    c->flags &= ~REDIS_BLOCKED;
    c->flags |= REDIS_UNBLOCKED;
    server.bpop_blocked_clients--;
    listAddNodeTail(server.unblocked_clients,c);
}

@schleppy
Copy link

schleppy commented Apr 26, 2018

Sorry for reviving such an old thread but this has been plaguing me for a few months. I have random nodes (in a 48 node cluster) that will at times go into the same state mentioned in this thread. The workers will scale up (max 10), and the parent process gets stuck in an endless loop

clock_gettime(CLOCK_MONOTONIC, {1935475, 380193849}) = 0
epoll_wait(15, {{EPOLLIN, {u32=33, u64=40959556913725473}}}, 1023, 589) = 1
recvfrom(33, "*4\r\n$8\r\npmessage\r\n$20\r\n/0.celery"..., 65536, 0, NULL, NULL) = 814
gettimeofday({1524727274, 889516}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
gettimeofday({1524727274, 889658}, NULL) = 0
select(34, [33], [], [], {0, 0})        = 0 (Timeout)
epoll_ctl(15, EPOLL_CTL_DEL, 37, {EPOLLRDNORM|EPOLLRDBAND|EPOLLWRNORM|EPOLLWRBAND|EPOLLMSG|EPOLLERR|EPOLLONESHOT|0xaa80800, {u32=32734, u64=22544854567387102}}) = -1 ENOENT (No such file or directory)
epoll_ctl(15, EPOLL_CTL_ADD, 33, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=33, u64=40959556913725473}}) = -1 EEXIST (File exists)
epoll_ctl(15, EPOLL_CTL_ADD, 30, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=30, u64=40959556913725470}}) = -1 EEXIST (File exists)
epoll_ctl(15, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=23, u64=40959556913725463}}) = -1 EEXIST (File exists)

I am also running on AWS / ElasticCache but am only currently using 340MB / 28G so memory is not an issue.

celery 4.1.0
kombu 4.1.0
prefork worker

Currently the only method to get this unstuck is a complete restart. This happens on average once a day, maybe a little less, to every node in the cluster. We average around 15 largish bodies tasks / second.

@schleppy
Copy link

Disregard. I just found #4185. I am testing the patch.

@wilderchen
Copy link

maybe it helps to diagnose the issue with the following linke: http://memos.ml/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

17 participants