Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange IllegalReferenceCountException #4435

Closed
doom369 opened this issue Nov 3, 2015 · 30 comments
Closed

Strange IllegalReferenceCountException #4435

doom369 opened this issue Nov 3, 2015 · 30 comments
Assignees
Labels
Milestone

Comments

@doom369
Copy link
Contributor

doom369 commented Nov 3, 2015

Hello. 4.0.32.Final, Ubuntu 14.04. I have a production system running for few weeks. Was reviewing logs and found this:

io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
    at io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:101)
    at io.netty.buffer.AbstractDerivedByteBuf.release(AbstractDerivedByteBuf.java:50)
    at io.netty.buffer.CompositeByteBuf$Component.freeIfNecessary(CompositeByteBuf.java:1370)
    at io.netty.buffer.CompositeByteBuf.deallocate(CompositeByteBuf.java:1625)
    at io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:106)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:252)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:152)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:805)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)
18:13:29.114 ERROR - Handler class : class cc.blynk.common.handlers.common.decoders.MessageDecoder. Name : MessageDecoder#0

Not sure is that netty bug or my own. Decoder code - https://github.com/blynkkk/blynk-server/blob/master/common/src/main/java/cc/blynk/common/handlers/common/decoders/MessageDecoder.java

@normanmaurer
Copy link
Member

@doom369 can you show me all the handlers you have in your pipeline ?

@ninja-
Copy link

ninja- commented Nov 4, 2015

@doom369 I am pretty sure you broke the refcnt somewhere in your handlers, somewhere near composite buffers...

@doom369
Copy link
Contributor Author

doom369 commented Nov 4, 2015

@ninja- how could that be if in pipeline where error happen there is no my code that manages refcnt at any way. Netty code itself manages this. Also server was running for a weeks with load ~700 req/sec and this is only such error I found.

@doom369
Copy link
Contributor Author

doom369 commented Nov 5, 2015

One more, but look like in another pipeline (with SSL Handler)

io.netty.handler.codec.DecoderException: io.netty.util.IllegalReferenceCountException: refCnt: 0
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:380)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:152)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:805)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:346)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.util.IllegalReferenceCountException: refCnt: 0
    at io.netty.buffer.AbstractByteBuf.ensureAccessible(AbstractByteBuf.java:1177)
    at io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1128)
    at io.netty.buffer.UnpooledUnsafeDirectByteBuf.internalNioBuffer(UnpooledUnsafeDirectByteBuf.java:480)
    at io.netty.handler.ssl.SslHandler.toByteBuffer(SslHandler.java:1146)
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1138)
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1028)
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:968)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:349)
    ... 13 more

@Scottmitch
Copy link
Member

@doom369 - It seems that buffer which is owned by ByteToMessageDecoder is being "double" released. However the decoder you linked to (MessageDecoder) copies the input buffer into a string, and does not release the input buffer, so that shouldn't create problems.

Is there any place where you don't copy the contents of the buffer and instead use the buffer (or a slice of the buffer) directly? For example writing a buffer you decode to a channel.

Can you reproduce the issue? If so it would be good to know what the reference count cumulation.refCnt() is on line ByteToMessageDecoder:244 before and after the call to callDecode.

I don't see any SSL initialization in HardwareServer, was this just for testing, or is there other code that we should be aware of?

@doom369
Copy link
Contributor Author

doom369 commented Nov 5, 2015

@Scottmitch

Is there any place where you don't copy the contents of the buffer and instead use the buffer (or a slice of the buffer) directly?

No. I deal with netty buffers only in MessageDecoder and MessageEncoder. In all other handlers I work with MessageBase that those classes produce.

Can you reproduce the issue?

No. It is just appearing in production system.

I don't see any SSL initialization in HardwareServer, was this just for testing, or is there other code that we should be aware of?

I have 4 different pipelines. 3 of them using SSL handlers. 1 of them (HardwareServer) doesn't. First stack I posted in ticket is without SSLHandler. So that means that exception came from pipeline of HardwareServer.
But second stack I posted is came from another pipeline (not HardwareServer) which one of those 3 I don't know.

All pipelines using MessageEncoder and MessageDecoder.

@normanmaurer
Copy link
Member

very strange... Currently I have no idea what's wrong :/

@Scottmitch
Copy link
Member

It is interesting that the original leak report you posted stems from the condition if your application code consumes all data in the culmination buffer then ByteToMessageDecoder will release the buffer ... that may explain why you only see it rarely. @doom369 - Can you verify how often you enter the below code block in ByteToMessageDecoder?

if (cumulation != null && !cumulation.isReadable()) {
  numReads = 0;
  cumulation.release();
  cumulation = null;
}

@doom369
Copy link
Contributor Author

doom369 commented Nov 7, 2015

@Scottmitch I'll do new deployment (that is planned) of 4.0.33 version. And will monitor it. If no error I'll close issue, otherwise will add more debug.

@doom369
Copy link
Contributor Author

doom369 commented Nov 11, 2015

4.0.33 didn't help. But I was able to reproduce issue on QA server. Will try to prepare some simple setup for you.

@normanmaurer
Copy link
Member

@doom369 thanks!

@doom369
Copy link
Contributor Author

doom369 commented Dec 20, 2015

@normanmaurer sorry for huge delay. So finally I had a time to play with that.
I'm currently using epoll (4.0.33.Final) + native open SSL via netty-tcnative (1.1.33.Fork8). I made a test that reproduces issue locally. Changing open SSL to JDK makes issue not reproducible, but I saw another error with JDK ssl (may be related) :

Reason : javax.net.ssl.SSLException: Tag mismatch!

So regarding reproducing. Here is a server jar. To run please use :

java -jar server-0.12.2-SNAPSHOT.jar -dataFolder /home/xxx/test &

Also please checkout this project. Find test SimplePerformanceTest.java remove @ignore annotation from method testConnectAppAndHardware and run it.

Now in directory from where you where running jar file you'll find logs folder with blynk.log file with above exception. If log didn't appear you may need to run test few times (issue is not 100% reproducible).

I have this problem on 3 envs. All Ubuntu, 12.04, 14.04. Java version "1.8.0_40", "1.8.0_45"

Please let me know if you need more info.

@normanmaurer normanmaurer self-assigned this Dec 20, 2015
@normanmaurer
Copy link
Member

@doom369 thanks will check!

@normanmaurer
Copy link
Member

@doom369 can you provide me with a version as well where I could include a different netty version to test ? Like not an "all in one jar".

@doom369
Copy link
Contributor Author

doom369 commented Dec 20, 2015

@normanmaurer sure, server.jar I provided is build from those project with SimplePerformanceTest.java test. You may build a project by yourself. Just change ``blynk-server/server/tcp-server/src/main/resources/server.properties` to

enable.native.epoll.transport=true
enable.native.openssl=true

And you may run

mvn clean install -Dmaven.test.skip=true

All netty versions are in main pom file.

You can find output in blynk-server/server/tcp-server/target folder

@normanmaurer
Copy link
Member

@doom369 this only happens with epoll ?

@doom369
Copy link
Contributor Author

doom369 commented Dec 20, 2015

epoll + open ssl. epoll + JDK SSL - not reproducible.

@normanmaurer
Copy link
Member

ok gotcha!

@normanmaurer
Copy link
Member

@doom369 ok can reproduce with your stuff... will come back to you :)

@doom369
Copy link
Contributor Author

doom369 commented Dec 20, 2015

nice =)

@normanmaurer
Copy link
Member

@doom369 so good thing is I know what happens... the bad, I just not understand why 👎

@doom369
Copy link
Contributor Author

doom369 commented Dec 21, 2015

=) Let me know if I can help somehow.

@normanmaurer normanmaurer added this to the 4.0.34.Final milestone Dec 21, 2015
@normanmaurer
Copy link
Member

@doom369 I now know what happens and also why (yay). Working on a fix... to make it short it has to do with the register/deregister that you do in your handler.

@doom369
Copy link
Contributor Author

doom369 commented Dec 21, 2015

@normanmaurer Hm... I had a feeling regarding that. Should tell you before. Glad you found problem.

@normanmaurer
Copy link
Member

@doom369 could you please verify this as well: #4621

@doom369
Copy link
Contributor Author

doom369 commented Dec 23, 2015

@normanmaurer I was not able to build epoll module, maybe you have ready build somewhere?

@normanmaurer
Copy link
Member

Nope :( what is the error?

Am 23.12.2015 um 17:57 schrieb Dmitriy Dumanskiy notifications@github.com:

@normanmaurer I was not able to build epoll module, maybe you have ready build somewhere?


Reply to this email directly or view it on GitHub.

@doom369
Copy link
Contributor Author

doom369 commented Dec 23, 2015

@normanmaurer It was issue with JDK. Fixed, made new build, tested with it, seems not reproducible any more. Thanks!

normanmaurer added a commit that referenced this issue Dec 24, 2015
Motivation:

As a user may call deregister() from within any method while doing processing in the ChannelPipeline,  we need to ensure we do the actual deregister operation later. This is needed as for example,  we may be in the ByteToMessageDecoder.callDecode(...) method and so still try to do processing in the old EventLoop while the user already registered the Channel to a new EventLoop. Without delay, the deregister operation this could lead to have a handler invoked by different EventLoop and so threads.

Modifications:

Ensure the actual deregister will be done later on and not directly when invoked.

Result:

Calling deregister() within ByteToMessageDecoder.decode(..) is safe.
normanmaurer added a commit that referenced this issue Dec 24, 2015
Motivation:

As a user may call deregister() from within any method while doing processing in the ChannelPipeline,  we need to ensure we do the actual deregister operation later. This is needed as for example,  we may be in the ByteToMessageDecoder.callDecode(...) method and so still try to do processing in the old EventLoop while the user already registered the Channel to a new EventLoop. Without delay, the deregister operation this could lead to have a handler invoked by different EventLoop and so threads.

Modifications:

Ensure the actual deregister will be done later on and not directly when invoked.

Result:

Calling deregister() within ByteToMessageDecoder.decode(..) is safe.
normanmaurer added a commit that referenced this issue Dec 24, 2015
Motivation:

As a user may call deregister() from within any method while doing processing in the ChannelPipeline,  we need to ensure we do the actual deregister operation later. This is needed as for example,  we may be in the ByteToMessageDecoder.callDecode(...) method and so still try to do processing in the old EventLoop while the user already registered the Channel to a new EventLoop. Without delay, the deregister operation this could lead to have a handler invoked by different EventLoop and so threads.

Modifications:

Ensure the actual deregister will be done later on and not directly when invoked.

Result:

Calling deregister() within ByteToMessageDecoder.decode(..) is safe.
@normanmaurer
Copy link
Member

Fixed via #4621. @doom369 thanks again for all the help to reproduce it and track down 👍

pulllock pushed a commit to pulllock/netty that referenced this issue Oct 19, 2023
…entLoop.

Motivation:

As a user may call deregister() from within any method while doing processing in the ChannelPipeline,  we need to ensure we do the actual deregister operation later. This is needed as for example,  we may be in the ByteToMessageDecoder.callDecode(...) method and so still try to do processing in the old EventLoop while the user already registered the Channel to a new EventLoop. Without delay, the deregister operation this could lead to have a handler invoked by different EventLoop and so threads.

Modifications:

Ensure the actual deregister will be done later on and not directly when invoked.

Result:

Calling deregister() within ByteToMessageDecoder.decode(..) is safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants