Description
So I have been battling this issue for a few days now but it's really hard to figure out exactly what causes it. I had a repro for a day on an amazon machine but unfortunately it was a spot instance and I no longer have access to it. I tried to make another repro on the weekend and failed. Here is what I'm observing:
- two amazon machines (does not matter if in vpc or not)
- install redis on one, celery on the other
- run celery with four prefork workers
- spam the queue with lots of tasks
After some period of time everything will lock up. Ways I have seen this recover:
- restart celery
CLIENT KILL
from the redis server (this will trigger some epoll errors).
We have a customer who also seems to have a related issue but it recovers after 30 minutes. I have not have a way to reproduce this issue, but it might be related to the client timeout on redis.
Things I know other than that:
- strace: when it hangs, it's trying to read from a redis socket.
- the read is _read_from_socket in SocketBuffer in py-redis
- changing timeouts on the redis side changes nothing
- enabling keepalives changes nothing
- a dummy python script that does BLPOP in a loop against the redis server with a second script that puts items into the queue does not get stuck.
I have a few theories on what this could be but I'm honestly at the end of my understanding of the issue. What I see is that kombu internally catches down random epoll errors and when I add some prints there I can see that I get epoll errors even though the socket seems alive and happy. I'm wondering if it could be possible that for some reason wrong fds are attempted to be registered with that thing.
Versions used:
- kombu==3.0.26
- celery==3.1.18
- redis==2.10.3
- Redis server: 3.0.4
- Happens with at least ubuntu 14.04
I have since tried to reproduce this again on another amazon machine and it does not appear to happen. I'm not entirely sure what triggers it but it seems like the socket py-redis tries to read from is not what it thinks it is. The socket was from what I can tell in ESTABLISHED.
I'm completely out of ideas how to continue debugging this especially because it's impossible for me to reproduce on anything other than this particular amazon machine. It does not appear on OS X, on two linux VMs running in virtualbox, not happening from one of my linux servers on hetzner to another one there and I could also not reproduce it form hetzner to amazon though in that particular case I saw some other artifacts that might have prevented debugging this (in particular the performance was quite bad and it continued on for a day without incidents).
I first thought this mighe be #1845 but from what I can tell there is no high CPU usage involved. It just blocks on a read. If I can help in any way with investigating this further, please let me know.
Activity
domenkozar commentedon Nov 2, 2015
I've observed exactly the same issue on Amazon and logged my input at #2606. At the end I figured out listing all redis connections with
CLIENT LIST
and killing the one withcmd=NULL
unblocks all clients, but never dig into why that connection exists or why is it blocking.mitsuhiko commentedon Nov 2, 2015
@domenkozar is that your solution at the moment? (To kill those connections in a cronjob?)
domenkozar commentedon Nov 2, 2015
Yes.
domenkozar commentedon Nov 2, 2015
I've always suspected kombu connection handling code, but never really had time to dig into it.
ask commentedon Nov 2, 2015
Was it always stuck reading from the socket? Or were there any syscalls in between?
That doing
client kill
forces the worker to recover means that the socket is in the event loop and that the worker is reading from it. So not sure how it could be the wrong fd, unless they are both in the event loop.Redis-py is a blocking library, and kombu does not actually put the socket in nonblocking mode. I can imagine a situation where the event loop tells us that the socket is readable, then something else in redis-py reads from the socket, and then kombu starts the blocking read when there's nothing to actually read.
I don't know how difficult it is to use a non-blocking connection with redis-py, but that could possibly solve some of these issues.
mitsuhiko commentedon Nov 2, 2015
As far as I remember it's stuck there, no other syscalls.
I am actually doing that in rb because of similar worries about things interfering. Look here: https://github.com/getsentry/rb/blob/fb9935fb750427338d90d46ca9af3c740360a31a/rb/clients.py#L31-L63
Jc2k commentedon Nov 5, 2015
Think i'm also seeing this too.
Again its on AWS, ubuntu trusty 14.04 for the worker and using ElastiCache Redis (2.8.19) for the queue/results.
When it happens the remote control interface is also taken out, so
celery.control.inspect().ping()
can't see the server (going to try and make use of that to have autoscaling remove instances that hang).Not noticed any pattern to it yet.
When it happens the worker processes will be receiving from a pipe (im assuming the master process and completely normal?):
However the master process gets stuck in a
recvfrom
:(Which is the redis box).
Killing client
10.1.2.195:6379
(which does show up inCLIENT LIST
as'cmd': 'NULL'
) allowed it to start processing tasks again. There was one other client in that state, which i killed as well, but the remote control API is still not able to see this celery worker.Version wise:
(Pre-fork as well, with 20 worker processes).
domenkozar commentedon Nov 5, 2015
Oh, one more thing to check: the issue started to appear once redis more or less ran out of ram. Not sure if that was relevant or not.
Jc2k commentedon Nov 6, 2015
For us, the redis monitoring says we've not dipped below 1.8GB of free memory.
mitsuhiko commentedon Nov 6, 2015
I did not look at memory when playing with it, but i deployed it on a micro machine for testing so I would not be surprised if it went low on memory.
jaysoffian commentedon Nov 12, 2015
I'm seeing this and hopefully I can shed some light. I'm running an older version of Sentry that uses Celery 3.0. As well, I'm running a separate worker pool (unrelated to Sentry) under Celery 3.1. The different worker pools each use their own dedicated Redis instance as a broker. This setup has been rock stable for me in two separate datacenters for over a year and I've never seen any hung workers. (The Sentry pool runs 6x6 workers on two hosts, the other pool runs 6x6 workers on 4 hosts. By 6x6 I mean I'm running 6 independent worker processes each with
--concurrency=6
on each host.)Recently, I've been working on setting up HA-Proxy as an outbound proxy on my worker nodes. So Celery/Kombu connects to localhost, which is HA-Proxy, then HA-Proxy connects on to Redis. (I'm using HA-Proxy to detect a master/slave Redis failover and reconnect to the new master.)
And all of a sudden I'm seeing these worker hangs, both with the Sentry Celery 3.0 workers and the other Celery 3.1 workers. Now here's the interesting part. I've configured HA-Proxy to drop idle TCP connections after 2 hours. And this is exactly the time frame in which I'm seeing the workers hang. This is surprising to me because I've been running Redis with a timeout of 2 hours since the beginning. The only difference I can think is according the the Redis docs, it doesn't apply timeouts to any pub/sub connections. So a pub/sub connection w/o any activity would be timed out by HA-Proxy, but not normally by Redis.
I suspect that HA-Proxy is dropping the pub/sub connection after 2 hours, but Kombu isn't noticing somehow? Either that or there's something different about how Redis drops an idle connection vs how HA-Proxy deals with it. What I'm really confused about is why there are idle connections in the first place and why dropping them would cause the Celery workers to hang. These are all very busy workers.
I'm going to try bumping up the HA-Proxy timeout to 24h and add a
socket_timeout
to theBROKER_TRANSPORT_OPTIONS
.Versions:
Seeing the same problem with both pools after trying to use HA-Proxy. Some log entries:
And it all looks good again. So the worker pool hung after 2 hours, then spontaneously recovered 2 hours after that. Another worker, same thing:
One more thing I notice... the last task before the pool hangs is completing (as evidenced by the
processed crash_id=bb6f805558bb47e3b05938aa9804cdec in 2.076387s
message). But celery doesn't indicate the task is complete till after after the 2 hour hang and says the task took two hours to complete.mitsuhiko commentedon Nov 12, 2015
I'm almost certain this is an issue with partially closed sockets. I have seen really weird things in the past with those on a variety of systems and as a result I typically run everything with TCP keepalives enabled. I will do another test to see if that changes anything.
35 remaining items