Skip to content

Redis Queue Worker Gets Stuck #2906

Closed
Closed
@mitsuhiko

Description

@mitsuhiko
Contributor

So I have been battling this issue for a few days now but it's really hard to figure out exactly what causes it. I had a repro for a day on an amazon machine but unfortunately it was a spot instance and I no longer have access to it. I tried to make another repro on the weekend and failed. Here is what I'm observing:

  • two amazon machines (does not matter if in vpc or not)
  • install redis on one, celery on the other
  • run celery with four prefork workers
  • spam the queue with lots of tasks

After some period of time everything will lock up. Ways I have seen this recover:

  • restart celery
  • CLIENT KILL from the redis server (this will trigger some epoll errors).

We have a customer who also seems to have a related issue but it recovers after 30 minutes. I have not have a way to reproduce this issue, but it might be related to the client timeout on redis.

Things I know other than that:

  • strace: when it hangs, it's trying to read from a redis socket.
  • the read is _read_from_socket in SocketBuffer in py-redis
  • changing timeouts on the redis side changes nothing
  • enabling keepalives changes nothing
  • a dummy python script that does BLPOP in a loop against the redis server with a second script that puts items into the queue does not get stuck.

I have a few theories on what this could be but I'm honestly at the end of my understanding of the issue. What I see is that kombu internally catches down random epoll errors and when I add some prints there I can see that I get epoll errors even though the socket seems alive and happy. I'm wondering if it could be possible that for some reason wrong fds are attempted to be registered with that thing.

Versions used:

  • kombu==3.0.26
  • celery==3.1.18
  • redis==2.10.3
  • Redis server: 3.0.4
  • Happens with at least ubuntu 14.04

I have since tried to reproduce this again on another amazon machine and it does not appear to happen. I'm not entirely sure what triggers it but it seems like the socket py-redis tries to read from is not what it thinks it is. The socket was from what I can tell in ESTABLISHED.

I'm completely out of ideas how to continue debugging this especially because it's impossible for me to reproduce on anything other than this particular amazon machine. It does not appear on OS X, on two linux VMs running in virtualbox, not happening from one of my linux servers on hetzner to another one there and I could also not reproduce it form hetzner to amazon though in that particular case I saw some other artifacts that might have prevented debugging this (in particular the performance was quite bad and it continued on for a day without incidents).

I first thought this mighe be #1845 but from what I can tell there is no high CPU usage involved. It just blocks on a read. If I can help in any way with investigating this further, please let me know.

Activity

domenkozar

domenkozar commented on Nov 2, 2015

@domenkozar

I've observed exactly the same issue on Amazon and logged my input at #2606. At the end I figured out listing all redis connections with CLIENT LIST and killing the one with cmd=NULL unblocks all clients, but never dig into why that connection exists or why is it blocking.

mitsuhiko

mitsuhiko commented on Nov 2, 2015

@mitsuhiko
ContributorAuthor

@domenkozar is that your solution at the moment? (To kill those connections in a cronjob?)

domenkozar

domenkozar commented on Nov 2, 2015

@domenkozar

Yes.

domenkozar

domenkozar commented on Nov 2, 2015

@domenkozar

I've always suspected kombu connection handling code, but never really had time to dig into it.

ask

ask commented on Nov 2, 2015

@ask
Contributor

Was it always stuck reading from the socket? Or were there any syscalls in between?

That doing client kill forces the worker to recover means that the socket is in the event loop and that the worker is reading from it. So not sure how it could be the wrong fd, unless they are both in the event loop.

Redis-py is a blocking library, and kombu does not actually put the socket in nonblocking mode. I can imagine a situation where the event loop tells us that the socket is readable, then something else in redis-py reads from the socket, and then kombu starts the blocking read when there's nothing to actually read.

I don't know how difficult it is to use a non-blocking connection with redis-py, but that could possibly solve some of these issues.

mitsuhiko

mitsuhiko commented on Nov 2, 2015

@mitsuhiko
ContributorAuthor

Was it always stuck reading from the socket? Or were there any syscalls in between?

As far as I remember it's stuck there, no other syscalls.

I don't know how difficult it is to use a non-blocking connection with redis-py, but that could possibly solve some of these issues.

I am actually doing that in rb because of similar worries about things interfering. Look here: https://github.com/getsentry/rb/blob/fb9935fb750427338d90d46ca9af3c740360a31a/rb/clients.py#L31-L63

Jc2k

Jc2k commented on Nov 5, 2015

@Jc2k

Think i'm also seeing this too.

Again its on AWS, ubuntu trusty 14.04 for the worker and using ElastiCache Redis (2.8.19) for the queue/results.

When it happens the remote control interface is also taken out, so celery.control.inspect().ping() can't see the server (going to try and make use of that to have autoscaling remove instances that hang).

Not noticed any pattern to it yet.

When it happens the worker processes will be receiving from a pipe (im assuming the master process and completely normal?):

$ sudo strace -p 1484
Process 1484 attached
read(27, 
celery  1484 someuser   27r  FIFO    0,8      0t0   9937 pipe
$ sudo lsof +p 1484
COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME

However the master process gets stuck in a recvfrom:

$ sudo strace -p 1438
Process 1438 attached
recvfrom(114, 
sudo lsof +p 1438
COMMAND  PID   USER   FD   TYPE  DEVICE SIZE/OFF   NODE NAME
<snip>
celery  1438 user  114u  IPv4 1056394      0t0    TCP 10.1.1.81:33058->10.1.2.195:6379 (ESTABLISHED)

(Which is the redis box).

Killing client 10.1.2.195:6379 (which does show up in CLIENT LIST as 'cmd': 'NULL') allowed it to start processing tasks again. There was one other client in that state, which i killed as well, but the remote control API is still not able to see this celery worker.

Version wise:

kombu==3.0.26
celery==3.1.16
redis==2.10.3

(Pre-fork as well, with 20 worker processes).

domenkozar

domenkozar commented on Nov 5, 2015

@domenkozar

Oh, one more thing to check: the issue started to appear once redis more or less ran out of ram. Not sure if that was relevant or not.

Jc2k

Jc2k commented on Nov 6, 2015

@Jc2k

For us, the redis monitoring says we've not dipped below 1.8GB of free memory.

mitsuhiko

mitsuhiko commented on Nov 6, 2015

@mitsuhiko
ContributorAuthor

I did not look at memory when playing with it, but i deployed it on a micro machine for testing so I would not be surprised if it went low on memory.

jaysoffian

jaysoffian commented on Nov 12, 2015

@jaysoffian

I'm seeing this and hopefully I can shed some light. I'm running an older version of Sentry that uses Celery 3.0. As well, I'm running a separate worker pool (unrelated to Sentry) under Celery 3.1. The different worker pools each use their own dedicated Redis instance as a broker. This setup has been rock stable for me in two separate datacenters for over a year and I've never seen any hung workers. (The Sentry pool runs 6x6 workers on two hosts, the other pool runs 6x6 workers on 4 hosts. By 6x6 I mean I'm running 6 independent worker processes each with --concurrency=6 on each host.)

Recently, I've been working on setting up HA-Proxy as an outbound proxy on my worker nodes. So Celery/Kombu connects to localhost, which is HA-Proxy, then HA-Proxy connects on to Redis. (I'm using HA-Proxy to detect a master/slave Redis failover and reconnect to the new master.)

And all of a sudden I'm seeing these worker hangs, both with the Sentry Celery 3.0 workers and the other Celery 3.1 workers. Now here's the interesting part. I've configured HA-Proxy to drop idle TCP connections after 2 hours. And this is exactly the time frame in which I'm seeing the workers hang. This is surprising to me because I've been running Redis with a timeout of 2 hours since the beginning. The only difference I can think is according the the Redis docs, it doesn't apply timeouts to any pub/sub connections. So a pub/sub connection w/o any activity would be timed out by HA-Proxy, but not normally by Redis.

I suspect that HA-Proxy is dropping the pub/sub connection after 2 hours, but Kombu isn't noticing somehow? Either that or there's something different about how Redis drops an idle connection vs how HA-Proxy deals with it. What I'm really confused about is why there are idle connections in the first place and why dropping them would cause the Celery workers to hang. These are all very busy workers.

I'm going to try bumping up the HA-Proxy timeout to 24h and add a socket_timeout to the BROKER_TRANSPORT_OPTIONS.

Versions:

  • celery==3.0.25, kombu==2.5.16, redis==2.10.5 (Sentry workers)
  • celery==3.1.19, kombu==3.0.29, redis==2.10.5 (Other workers)

Seeing the same problem with both pools after trying to use HA-Proxy. Some log entries:

[2015-11-11 23:39:51,220: INFO/MainProcess] Connected to redis://localhost:6380//
[2015-11-11 23:39:51,231: INFO/MainProcess] mingle: searching for neighbors
[2015-11-11 23:39:52,248: INFO/MainProcess] mingle: sync with 18 nodes
[2015-11-11 23:39:52,249: INFO/MainProcess] mingle: sync complete
[2015-11-11 23:39:52,264: INFO/MainProcess] celery@processor-00.proc1 ready.
[2015-11-11 23:40:12,254: INFO/MainProcess] missed heartbeat from celery@processor-05.proc4
... a bunch of missed heartbeat messages as other workers are restarting ...
[2015-11-11 23:41:34,004: INFO/MainProcess] sync with celery@processor-05.proc2
[2015-11-11 23:41:34,020: INFO/MainProcess] sync with celery@processor-01.proc2
[2015-11-11 23:41:34,051: INFO/MainProcess] sync with celery@processor-04.proc2
[2015-11-11 23:41:34,054: INFO/MainProcess] sync with celery@processor-00.proc2
[2015-11-11 23:41:34,193: INFO/MainProcess] sync with celery@processor-03.proc2
[2015-11-11 23:41:34,196: INFO/MainProcess] sync with celery@processor-02.proc2
[2015-11-11 23:41:34,485: INFO/MainProcess] Received task: crash_processor.tasks.process_crash[1f9444e7-e974-4901-ac3b-59bcee331ab4]

... lots and lots of tasks...

[2015-11-12 01:41:33,604: INFO/MainProcess] Received task: crash_processor.tasks.process_crash[2f2a2b3c-e410-47a6-b9ad-e282e04ae80b]

... no more tasks received.... :-(

And then...

[2015-11-12 01:41:34,334: INFO/Worker-4] [2f2a2b3c-e410-47a6-b9ad-e282e04ae80b] processed crash_id=99d88eb1a1be4a4691ea292cb1857020 in 0.728821s (sentry_rt=26.4ms)
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-02.proc1
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-03.proc1
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-01.proc1
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-04.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-05.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-05.proc4
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-02.proc4
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-03.proc4
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-01.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-04.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-00.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-03.proc3
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-00.proc3
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-05.proc3
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-04.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-01.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-02.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-00.proc2
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-05.proc2
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-01.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-04.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-02.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-03.proc2
[2015-11-12 03:41:34,209: INFO/MainProcess] Task crash_processor.tasks.process_crash[2f2a2b3c-e410-47a6-b9ad-e282e04ae80b] succeeded in 7200.60400889s: None
[2015-11-12 03:41:34,210: WARNING/MainProcess] Substantial drift from celery@processor-05.proc2 may mean clocks are out of sync.  Current drift is
7200 seconds.  [orig: 2015-11-12 03:41:34.210229 recv: 2015-11-12 01:41:34.394871]
[2015-11-12 03:41:34,211: INFO/MainProcess] Received task: crash_processor.tasks.process_crash[c2dd5b07-554b-48dc-951c-499e7b312710]

And it all looks good again. So the worker pool hung after 2 hours, then spontaneously recovered 2 hours after that. Another worker, same thing:

[2015-11-11 23:39:51,206: INFO/MainProcess] Connected to redis://localhost:6380//
[2015-11-11 23:39:51,221: INFO/MainProcess] mingle: searching for neighbors
[2015-11-11 23:39:52,239: INFO/MainProcess] mingle: sync with 18 nodes
[2015-11-11 23:39:52,239: INFO/MainProcess] mingle: sync complete
[2015-11-11 23:39:52,251: INFO/MainProcess] celery@processor-01.proc1 ready.
[2015-11-12 01:41:32,746: INFO/MainProcess] Received task: crash_processor.tasks.process_crash[7e67c9db-e962-48f2-9feb-2d53d05ab8e4]
[2015-11-12 01:41:34,824: INFO/Worker-2] [7e67c9db-e962-48f2-9feb-2d53d05ab8e4] processed crash_id=bb6f805558bb47e3b05938aa9804cdec in 2.076387s (sentry_rt=932.4ms)
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-00.proc1
[2015-11-12 03:41:34,203: INFO/MainProcess] missed heartbeat from celery@processor-02.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-03.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-04.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-05.proc1
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-05.proc4
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-02.proc4
[2015-11-12 03:41:34,204: INFO/MainProcess] missed heartbeat from celery@processor-03.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-01.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-04.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-00.proc4
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-03.proc3
[2015-11-12 03:41:34,205: INFO/MainProcess] missed heartbeat from celery@processor-00.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-05.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-04.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-01.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-02.proc3
[2015-11-12 03:41:34,206: INFO/MainProcess] missed heartbeat from celery@processor-00.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-05.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-01.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-04.proc2
[2015-11-12 03:41:34,207: INFO/MainProcess] missed heartbeat from celery@processor-02.proc2
[2015-11-12 03:41:34,208: INFO/MainProcess] missed heartbeat from celery@processor-03.proc2
[2015-11-12 03:41:34,210: INFO/MainProcess] Received task: crash_processor.tasks.process_crash[0681245d-c418-4c42-b8a5-acc2970bfd5c]
[2015-11-12 03:41:34,211: WARNING/MainProcess] Substantial drift from celery@processor-05.proc2 may mean clocks are out of sync.  Current drift is
7200 seconds.  [orig: 2015-11-12 03:41:34.211312 recv: 2015-11-12 01:41:34.394871]

[2015-11-12 03:41:34,212: INFO/MainProcess] Task crash_processor.tasks.process_crash[7e67c9db-e962-48f2-9feb-2d53d05ab8e4] succeeded in 7201.4646579s: None

One more thing I notice... the last task before the pool hangs is completing (as evidenced by the processed crash_id=bb6f805558bb47e3b05938aa9804cdec in 2.076387s message). But celery doesn't indicate the task is complete till after after the 2 hour hang and says the task took two hours to complete.

mitsuhiko

mitsuhiko commented on Nov 12, 2015

@mitsuhiko
ContributorAuthor

I'm almost certain this is an issue with partially closed sockets. I have seen really weird things in the past with those on a variety of systems and as a result I typically run everything with TCP keepalives enabled. I will do another test to see if that changes anything.

35 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @ask@mitsuhiko@Jc2k@dqminh@antirez

        Issue actions

          Redis Queue Worker Gets Stuck · Issue #2906 · celery/celery