Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.StackOverflowError for the entire cluster #24553

Closed
moshe opened this issue May 8, 2017 · 13 comments
Closed

java.lang.StackOverflowError for the entire cluster #24553

moshe opened this issue May 8, 2017 · 13 comments
Assignees
Labels
>bug :Search/Search Search-related issues that do not fall into other categories

Comments

@moshe
Copy link

moshe commented May 8, 2017

Elasticsearch version:
5.2.1

Plugins installed:
discovery-ec2
repository-s3
search-guard-5
x-pack (xpack.security.enabled: false xpack.monitoring.enabled: false xpack.graph.enabled: false xpack.watcher.enabled: false)

JVM version (java -version):
1.8.0_121

OS version (uname -a if on a Unix-like system):
Linux ip-192-168-153-58 3.19.0-79-generic #87~14.04.1-Ubuntu SMP Wed Dec 21 18:12:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
I have no idea which query produced the error but suddenly all the data node of the cluster (13) got the same error:

[2017-05-08T07:27:45,329][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [192.168.153.58] fatal error in thread [elasticsearch[192.168.153.58][search][T#8]], exiting
java.lang.StackOverflowError: null

and then an infinite numbers of the same line:

        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]

More details:

  • running on AWS
  • 3 dedicated master nodes
  • Big docs (70Kb avg)
  • Usage patterns: Date histograms, Terms and bool filters mostly
  • All text fields indexed with keyword and lowercase_normalizer
@nik9000 nik9000 self-assigned this May 8, 2017
@nik9000
Copy link
Member

nik9000 commented May 8, 2017

I'm assigning this to myself because I recognize the elements in the stacktrace but I'm not going to have a look until the at least tomorrow morning.

@moshe could you post a gist of the entire stack overflow? It is usually useful to have the root.

@moshe
Copy link
Author

moshe commented May 9, 2017

Hi @nik9000, thanks for your quick response.
I didn't got stacktrace in the log file, just the error and then the lots of isFinite method call errors.

@nik9000
Copy link
Member

nik9000 commented May 9, 2017

  // TODO: not great that this is recursive... in theory a
  // large automata could exceed java's stack

@nik9000
Copy link
Member

nik9000 commented May 9, 2017

So reproduced....

@nik9000 nik9000 assigned jimczi and unassigned nik9000 May 9, 2017
@cbuescher cbuescher added the :Search/Search Search-related issues that do not fall into other categories label May 10, 2017
@jimczi
Copy link
Contributor

jimczi commented May 11, 2017

@moshe this stack overflow can happen when searching a very big regular expression or when using a syntax for regexp that can lead to an explosion of states. Are you using query_string or regexp queries ? Are you searching regular expressions explicitely ?
For instance the following query:

POST /test/_search
{
  "query": {
    "regexp": {
      "test": "t{1,9500}"
    }
  }
}

fails with a stack overflow error. We have a protection against the explosion of number of states in a non deterministic automaton but for a deterministic one we don't check the size.
This is a known issue in Lucene:
https://issues.apache.org/jira/browse/LUCENE-5659
... but the fix is controversial. Let's check first why you're hitting this.

@xgwu
Copy link

xgwu commented Jun 15, 2017

One of our production cluster experienced the same issue recently due to the abuse of regex/fuzzy queries by developers, but could it be better for Elasticsearch (or maybe Lucene) to set a limit like that on non deterministic automaton, such that the cluster won't be entirely brought down?

Thanks!

@xgwu
Copy link

xgwu commented Jun 15, 2017

After more in-depth investigation, our particular issue was found being caused by prefix query on a very long query string. By looking at source code, I see PrefixQuery does not put a limit on maxDeterminizedStates when instantiate an AutomatonQuery.

public class PrefixQuery extends AutomatonQuery {

  /** Constructs a query for terms starting with <code>prefix</code>. */
  public PrefixQuery(Term prefix) {
    // It's OK to pass unlimited maxDeterminizedStates: the automaton is born small and determinized:
    super(prefix, toAutomaton(prefix.bytes()), Integer.MAX_VALUE, true);
    if (prefix == null) {
      throw new NullPointerException("prefix must not be null");
    }
  }

Is it sensible to limit prefix length & maxDeterminizedStates here?

@jimczi
Copy link
Contributor

jimczi commented Jun 15, 2017

Thanks for the investigation @xgwu . The automaton is already determinized so the maxDeterminizedStates would not catch very long string here. I think the fix is to limit the input on the ES side for queries that uses an automaton. This would not catch regex query like t{1,9500} but I have the feeling that most of the problems described in this issue are just about very long strings not filtered.

@nik9000
Copy link
Member

nik9000 commented Jun 15, 2017

I still think the fix is to put protections in Lucene to prevent the stackoverflow or to rewrite the method so it isn't recursive. I don't think we can allow any queries to shoot the node like this. We could catch the StackOverflow and try to recover but I believe those errors are like OOM, hard to be sure that you've fully recovered from it. If we investigate that to the point where we are sure we have recovered from it then we can just catch it and Lucene doesn't have to change.

@hbrxa
Copy link

hbrxa commented Jun 30, 2017

We filter long strings and do not use regex queries like mentioned above. We have no idea which query produces this error but it kills elastic several times a day. Any clue what kind of query shoots the node? Would be great if this bug could be fixed as soon as possible.

@clintongormley
Copy link

@hbrxa it sounds like you are suffering from a different bug. please open a new issue with all of the relevant details.

@segalziv
Copy link

segalziv commented Jul 23, 2017

+1

We had a similar issue (on v5.3.0) , coming from kibana users running a simple regex. The regex doesn't even have to really run as search in order to cause the StackOverflowException - just running _validate on such a query cause the same issue.

To reproduce it just type in kibana's Discover:

/.{10000,}/

@jimczi
Copy link
Contributor

jimczi commented Aug 21, 2017

This issue has been fixed in Lucene:
https://issues.apache.org/jira/browse/LUCENE-7914
... and will be available in es 6.0
The workaround for es 5.x is to check the size of the prefix on the client side in order to prevent the stack overflow (something like a few hundreds chars should be a good limit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

9 participants