Skip to content

java.lang.StackOverflowError for the entire cluster #24553

Closed
@moshe

Description

@moshe

Elasticsearch version:
5.2.1

Plugins installed:
discovery-ec2
repository-s3
search-guard-5
x-pack (xpack.security.enabled: false xpack.monitoring.enabled: false xpack.graph.enabled: false xpack.watcher.enabled: false)

JVM version (java -version):
1.8.0_121

OS version (uname -a if on a Unix-like system):
Linux ip-192-168-153-58 3.19.0-79-generic #87~14.04.1-Ubuntu SMP Wed Dec 21 18:12:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
I have no idea which query produced the error but suddenly all the data node of the cluster (13) got the same error:

[2017-05-08T07:27:45,329][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [192.168.153.58] fatal error in thread [elasticsearch[192.168.153.58][search][T#8]], exiting
java.lang.StackOverflowError: null

and then an infinite numbers of the same line:

        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1051) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]

More details:

  • running on AWS
  • 3 dedicated master nodes
  • Big docs (70Kb avg)
  • Usage patterns: Date histograms, Terms and bool filters mostly
  • All text fields indexed with keyword and lowercase_normalizer

Activity

self-assigned this
on May 8, 2017
nik9000

nik9000 commented on May 8, 2017

@nik9000
Member

I'm assigning this to myself because I recognize the elements in the stacktrace but I'm not going to have a look until the at least tomorrow morning.

@moshe could you post a gist of the entire stack overflow? It is usually useful to have the root.

moshe

moshe commented on May 9, 2017

@moshe
Author

Hi @nik9000, thanks for your quick response.
I didn't got stacktrace in the log file, just the error and then the lots of isFinite method call errors.

nik9000

nik9000 commented on May 9, 2017

@nik9000
Member
  // TODO: not great that this is recursive... in theory a
  // large automata could exceed java's stack
nik9000

nik9000 commented on May 9, 2017

@nik9000
Member

So reproduced....

assigned and unassigned on May 9, 2017
added
:Search/SearchSearch-related issues that do not fall into other categories
on May 10, 2017
jimczi

jimczi commented on May 11, 2017

@jimczi
Contributor

@moshe this stack overflow can happen when searching a very big regular expression or when using a syntax for regexp that can lead to an explosion of states. Are you using query_string or regexp queries ? Are you searching regular expressions explicitely ?
For instance the following query:

POST /test/_search
{
  "query": {
    "regexp": {
      "test": "t{1,9500}"
    }
  }
}

fails with a stack overflow error. We have a protection against the explosion of number of states in a non deterministic automaton but for a deterministic one we don't check the size.
This is a known issue in Lucene:
https://issues.apache.org/jira/browse/LUCENE-5659
... but the fix is controversial. Let's check first why you're hitting this.

xgwu

xgwu commented on Jun 15, 2017

@xgwu

One of our production cluster experienced the same issue recently due to the abuse of regex/fuzzy queries by developers, but could it be better for Elasticsearch (or maybe Lucene) to set a limit like that on non deterministic automaton, such that the cluster won't be entirely brought down?

Thanks!

xgwu

xgwu commented on Jun 15, 2017

@xgwu

After more in-depth investigation, our particular issue was found being caused by prefix query on a very long query string. By looking at source code, I see PrefixQuery does not put a limit on maxDeterminizedStates when instantiate an AutomatonQuery.

public class PrefixQuery extends AutomatonQuery {

  /** Constructs a query for terms starting with <code>prefix</code>. */
  public PrefixQuery(Term prefix) {
    // It's OK to pass unlimited maxDeterminizedStates: the automaton is born small and determinized:
    super(prefix, toAutomaton(prefix.bytes()), Integer.MAX_VALUE, true);
    if (prefix == null) {
      throw new NullPointerException("prefix must not be null");
    }
  }

Is it sensible to limit prefix length & maxDeterminizedStates here?

jimczi

jimczi commented on Jun 15, 2017

@jimczi
Contributor

Thanks for the investigation @xgwu . The automaton is already determinized so the maxDeterminizedStates would not catch very long string here. I think the fix is to limit the input on the ES side for queries that uses an automaton. This would not catch regex query like t{1,9500} but I have the feeling that most of the problems described in this issue are just about very long strings not filtered.

nik9000

nik9000 commented on Jun 15, 2017

@nik9000
Member

I still think the fix is to put protections in Lucene to prevent the stackoverflow or to rewrite the method so it isn't recursive. I don't think we can allow any queries to shoot the node like this. We could catch the StackOverflow and try to recover but I believe those errors are like OOM, hard to be sure that you've fully recovered from it. If we investigate that to the point where we are sure we have recovered from it then we can just catch it and Lucene doesn't have to change.

hbrxa

hbrxa commented on Jun 30, 2017

@hbrxa

We filter long strings and do not use regex queries like mentioned above. We have no idea which query produces this error but it kills elastic several times a day. Any clue what kind of query shoots the node? Would be great if this bug could be fixed as soon as possible.

clintongormley

clintongormley commented on Jun 30, 2017

@clintongormley
Contributor

@hbrxa it sounds like you are suffering from a different bug. please open a new issue with all of the relevant details.

segalziv

segalziv commented on Jul 23, 2017

@segalziv

+1

We had a similar issue (on v5.3.0) , coming from kibana users running a simple regex. The regex doesn't even have to really run as search in order to cause the StackOverflowException - just running _validate on such a query cause the same issue.

To reproduce it just type in kibana's Discover:

/.{10000,}/

jimczi

jimczi commented on Aug 21, 2017

@jimczi
Contributor

This issue has been fixed in Lucene:
https://issues.apache.org/jira/browse/LUCENE-7914
... and will be available in es 6.0
The workaround for es 5.x is to check the size of the prefix on the client side in order to prevent the stack overflow (something like a few hundreds chars should be a good limit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

:Search/SearchSearch-related issues that do not fall into other categories>bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @clintongormley@nik9000@colings86@moshe@segalziv

      Issue actions

        java.lang.StackOverflowError for the entire cluster · Issue #24553 · elastic/elasticsearch