Closed
Description
Elasticsearch version:
5.2.1
Plugins installed:
discovery-ec2
repository-s3
search-guard-5
x-pack (xpack.security.enabled: false xpack.monitoring.enabled: false xpack.graph.enabled: false xpack.watcher.enabled: false)
JVM version (java -version
):
1.8.0_121
OS version (uname -a
if on a Unix-like system):
Linux ip-192-168-153-58 3.19.0-79-generic #87~14.04.1-Ubuntu SMP Wed Dec 21 18:12:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
I have no idea which query produced the error but suddenly all the data node of the cluster (13) got the same error:
[2017-05-08T07:27:45,329][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [192.168.153.58] fatal error in thread [elasticsearch[192.168.153.58][search][T#8]], exiting
java.lang.StackOverflowError: null
and then an infinite numbers of the same line:
at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
More details:
- running on AWS
- 3 dedicated master nodes
- Big docs (70Kb avg)
- Usage patterns: Date histograms, Terms and bool filters mostly
- All text fields indexed with
keyword
andlowercase_normalizer
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
nik9000 commentedon May 8, 2017
I'm assigning this to myself because I recognize the elements in the stacktrace but I'm not going to have a look until the at least tomorrow morning.
@moshe could you post a gist of the entire stack overflow? It is usually useful to have the root.
moshe commentedon May 9, 2017
Hi @nik9000, thanks for your quick response.
I didn't got stacktrace in the log file, just the error and then the lots of
isFinite
method call errors.nik9000 commentedon May 9, 2017
nik9000 commentedon May 9, 2017
So reproduced....
jimczi commentedon May 11, 2017
@moshe this stack overflow can happen when searching a very big regular expression or when using a syntax for regexp that can lead to an explosion of states. Are you using
query_string
orregexp
queries ? Are you searching regular expressions explicitely ?For instance the following query:
fails with a stack overflow error. We have a protection against the explosion of number of states in a non deterministic automaton but for a deterministic one we don't check the size.
This is a known issue in Lucene:
https://issues.apache.org/jira/browse/LUCENE-5659
... but the fix is controversial. Let's check first why you're hitting this.
xgwu commentedon Jun 15, 2017
One of our production cluster experienced the same issue recently due to the abuse of regex/fuzzy queries by developers, but could it be better for Elasticsearch (or maybe Lucene) to set a limit like that on non deterministic automaton, such that the cluster won't be entirely brought down?
Thanks!
xgwu commentedon Jun 15, 2017
After more in-depth investigation, our particular issue was found being caused by prefix query on a very long query string. By looking at source code, I see PrefixQuery does not put a limit on maxDeterminizedStates when instantiate an AutomatonQuery.
Is it sensible to limit prefix length & maxDeterminizedStates here?
jimczi commentedon Jun 15, 2017
Thanks for the investigation @xgwu . The automaton is already determinized so the
maxDeterminizedStates
would not catch very long string here. I think the fix is to limit the input on the ES side for queries that uses an automaton. This would not catch regex query liket{1,9500}
but I have the feeling that most of the problems described in this issue are just about very long strings not filtered.nik9000 commentedon Jun 15, 2017
I still think the fix is to put protections in Lucene to prevent the stackoverflow or to rewrite the method so it isn't recursive. I don't think we can allow any queries to shoot the node like this. We could catch the StackOverflow and try to recover but I believe those errors are like OOM, hard to be sure that you've fully recovered from it. If we investigate that to the point where we are sure we have recovered from it then we can just catch it and Lucene doesn't have to change.
hbrxa commentedon Jun 30, 2017
We filter long strings and do not use regex queries like mentioned above. We have no idea which query produces this error but it kills elastic several times a day. Any clue what kind of query shoots the node? Would be great if this bug could be fixed as soon as possible.
clintongormley commentedon Jun 30, 2017
@hbrxa it sounds like you are suffering from a different bug. please open a new issue with all of the relevant details.
segalziv commentedon Jul 23, 2017
+1
We had a similar issue (on v5.3.0) , coming from kibana users running a simple regex. The regex doesn't even have to really run as search in order to cause the StackOverflowException - just running _validate on such a query cause the same issue.
To reproduce it just type in kibana's Discover:
/.{10000,}/
jimczi commentedon Aug 21, 2017
This issue has been fixed in Lucene:
https://issues.apache.org/jira/browse/LUCENE-7914
... and will be available in es 6.0
The workaround for es 5.x is to check the size of the prefix on the client side in order to prevent the stack overflow (something like a few hundreds chars should be a good limit).