-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Description
Elasticsearch version:
5.0.0
Plugins installed: []
None
JVM version:
1.8.0_77-b03
OS version:
CentOS release 6.4 (Final)
Description of the problem including expected versus actual behavior:
One of our data node's is suffering from high heap usage last night and old GC was not able to reclaim any heap space. At the time, either bulk or queries were light and all thread pools were pretty idle. The node is one of the 120 data node cluster for logs analysis. Every night we have maintenance job deleting/force merging cold data and creating indices/aliases for the new day.
The node is configured with 31GB of heap and holds about 450 shards, 2k-2.5k segments. For the past week, the segment count/memory remained constant and even dropping. Heap usage had been crawling up until I restarted the node last night. I looked at all memory related stats from our monitoring systems and could not find the culprit for increasing heap usage.
Before restarting the node, I took a heap dump and analyzed with MAT. The huge number of org.elasticsearch.cluster.metadata.AliasOrIndex$Alias
objects looks suspicious. They retained nearly 7GB of memory.
We do use alias intensively and there are 40k aliases in total across the whole cluster. After the node was recovered, another heap dump was taken. This time the number of org.elasticsearch.cluster.metadata.AliasOrIndex$Alias
objects dropped to 673,427 instances and retained only 16MB of memory.
Does this suggest memory leak on Alias metadata?
Activity
ywelsch commentedon Dec 7, 2016
xgwu commentedon Dec 7, 2016
@ywelsch
There are currently about 40,000 aliases created in the cluster.
Below is the screenshot of expanded QueryShardContext

I am willing to share the heap dump but it's 30GB in size. It would be hard for me to upload it to S3 considering I'm located in China. :(
ywelsch commentedon Dec 7, 2016
If I see this correctly, the indices request cache (
IndicesRequestCache
) is holding onto a search context which holds onto the cluster state when the request was started. This has been fixed in 5.0.1, see here: #21284