-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run away search thread #21611
Comments
This looks bad. When this happens, does the cluster eventually heal, or do you need to restart it to get back to normal? Given that we do not seem to leak search contexts (the number remains reasonable), it looks as if when reaching some threshold in terms of search load, then things get worse and worse exponentially. I'd be very curious what the search threads are busy doing when that problem occurs, would you be able to get the node hot threads while your queue is filling up? In #21612 you suspect the fact that we use the |
@jpountz When this happens, I am not able to tell if it could heal by itself as this is a production cluster and there are constant queries from the users. I had to restart the nodes in problem in trying to bring service back to normal ASAP. I also tried to get hot threads when this happens but the api call seems not returning anything. Some other api call such as node stats takes very long time to respond. I'll try get hot threads again when it reoccur. I'll try |
The fact the stats APIs take a long time to return is concerning too, these APIs are supposed to be cheap. Out of curiosity, is garbage collection activity still normal when your cluster is having trouble? |
Can you use |
@nik9000 I've already updated indices to using niofs and tried some very expensive queries with high concurrency and the result seems positive. The performance looks better but I'll wait to see if this solves the issue tomorrow during business hours. If the issue happens again, I'll capture a snapshot of the running threads with jstat as you suggested. |
It'd probably be useful to have a before/after as well. I'd take a snapshot when you next have a chance just to have a picture of it. Also so you are practiced at whatever hoops you have to go through to do it. I know when things are going sideways I don't want to try new things. |
I'll closely make minitor the cluster status today during business peak time to see if the change makes big difference. If the issue doesn't re-occure, i'll arrange some time to revert back index store back to mmap, stress test to trigger the problem so to capture threads info before/after. ? 2016?11?18??01:08?Nik Everett <notifications@github.commailto:notifications@github.com> ??? If the issue happens again, I'll capture a snapshot of the running threads with jstat as you suggested. It'd probably be useful to have a before/after as well. I'd take a snapshot when you next have a chance just to have a picture of it. Also so you are practiced at whatever hoops you have to go through to do it. I know when things are going sideways I don't want to try new things. You are receiving this because you authored the thread. |
Since I changed index store to niofs last Thursday night, the cluster survived the Friday business peak. I managed to find some time today (Sunday) to carry on some tests. First I changed index store of one type of big index back to default, ie, mmapfs, and send some queries to 3 days of indices using Kibana. All the queries are with timeout of 35 seconds set. Not long after I send these queries, a few nodes appeared struggling in searching again. It took quite a long time for active threads to drop down. I then changed the store to niofs and did more expensive queries to compare the threads stats. Below screenshot shows the result (stats from all nodes were stacked). The first bunch of queries were sent at 13:00 and it took a long time for active threads to drop down. Several search threads not terminated even after an hour, so I have to restart some nodes . The second bunches were sent at about 14:40 and with more days of indices included in the request. This time search active peaked for a short time and search completed are much higher than previous test. For the first test, i captured more details on a particular node in problem. Here's what its thread pool stats looked like: Initially, open context stayed the same number as that of active threads. When I reverted index settings back by closing it at 14:08, I noticed open context dropped to 0 but active threads remains at 2 (The time difference may not be that obvious on screenshot I attached). I am not sure if it will eventually drop to 0 , but I am impatient to wait and so restarted the node at 14:18 after when the last 2 active thread is gone. Followed @nik9000 's advice, I captured threads dump with jstack at different time point (Suffixed with HH:MM) on node in problem. Please refer to below attached file. At 14:13, one search thread looks to be building global ordinals. |
@xgwu Can you clarify which time range are using the default and which time range is using |
Also how many nodes, data (in GB) and shards do you have in total? Also, how much RAM do you give to the JVM and to the filesystem cache on each node? |
It was mmapfs between 13:00 to 14:15 and niofs since 14:30. We have 120 data nodes, 17000 shards,800billion docs , 900TB data in total including one replica. 60 out of the 120 nodes are backup nodes without write. They are hosted by physical servers of 32 core, 128GB RAM & 12 4TB spin disks with raid0. Each physical host contains 3 es instances with 31GB heap each, ie,35GB left for filesystem cache. Other nodes are mainly for write and only run 1 instance on each host. They only hold |
Sorry closed the issue by making mistake. Reopening... |
Write nodes only hold one day worth data. The problem is likely to happen only on some read only backup nodes. |
To clarify: when you say The most immediate issue here is the sizing of the filesystem cache. Out of 128GB per physical server, 93GB go to the JVM and only 35GB go to the filesystem cache. There should be at least 64GB for the filesystem, probably even more. Does your JVM make use of all the available memory? If yes, do you know why? For instance are you using the completion suggester? You might want to give less memory to your JVMs and/or start fewer nodes per physical machine in order to give more memory to the filesystem cache. Note that decreasing the amount of memory per JVM also has the side benefit of making garbage collection more efficient since smaller heaps can be collected more quickly. The jstack traces show that an Since your nodes spend a lot of time building global ordinals, you might also want to look into increasing the |
@jpountz Yes, the node is meant an elasticsearch process. One more point to clarify that this issue also happened on a write node that have only one elasticsearch process with 31GB JVM on a physical machine. In that case, 97GB ram is available for filesystem cache. But it only happened once compared to several times a day on those cold data nodes. This cluster is for application/business logs search and analysis. The ops/business require a fairly long history data, for example last 3 months, be immediately available for search whenever they need it. To keep so huge cold indices open in elasticsearch, it would require a lot of space in JVM for segment memory. In 2.4 era, I found it a good balance to put 3 instances on 1 physical box for this kind of use case, as it allows us to hold about 30TB indices open on a single physical machine and the search performance is acceptable even only leaving 35GB memory to filesystem cache. And you are right, that we force merge each shard to 1 segment every night on these cold data nodes to reduce segment memory footprint. The refresh_interval is already 60s globally on our cluster and worked well for a long time. I am not sure whether setting We don't use completion suggester on this cluster. The mostly used features are queries, filters plus terms, data histogram and percentiles aggregation. For some terms aggregation, the cardinality could be very high, such as client ip field or url stem field. Per your comment, it feels like when reading every part of an indices via For now I'll stick to niofs as it looked performant enough in our situation and I am not seeing long running search threads anymore. |
The reason that made me think about it is that the jstack traces show that some search threads are building global ordinals, which would never happen if global ordinals were built at refresh time.
The way that the timeout is implemented, Elasticsearch has some checkpoints during the execution of a search request when it checks the timeout. Unfortunately, some parts of query execution do not check the timeout while they are running, such as when rewriting multi-term queries (eg. ranges, wildcard or fuzzy queries) or building global ordinals.
This sounds like a good idea indeed since niofs seems to perform better in your case. |
Since it looks like using |
Elasticsearch version:
5.0.0
Plugins installed: []
None
JVM version:
1.8.0_77
OS version:
CentOS release 6.4 (Final)
Linux 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
On some data nodes with expensive query, the search threads could be active for hours even with search timeout set to 35 seconds. During that period, disk IO peaks in reading and search thread pilled up in queue with hard time getting into being served by active threads.
Steps to reproduce:
This happens rather frequently on a daily basis when there are some expensive search loads hitting the cluster. Back in v2.4.0 time , this never happened.
Provide logs (if relevant):




During the occurence of the issue , I observed from our monitoring system the number of active search thread peaks to max limit even though the open search context dropped to a few , and search completed is hardly stepping forward. Tasks management api shows active query running for a very long time.
Compared by similar stats weeks ago before upgrading to 5.0.0, the number of search active threads anytime was almost in line with that of search open context.



It looks to me search threads could be not freed after the search context had been terminated.
The text was updated successfully, but these errors were encountered: