一、现象
在服务客户端中,每天都发现有几个连接服务端异常的exception。少的时候没有这样的exception,但多的时候一天有几次,特别是周末的时候,比较多。初步分析是网络问题,但是做了网络实时监控以后,发现即使网络十分通常,也偶尔会出现这样的问题,所以网络问题被排除了。
出现问题的报错如下
Caused by: java.net.SocketTimeoutException: Read timed out
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1890)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1885)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1884)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1457)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:122)
... 21 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1536)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at com.caucho.hessian.client.HessianURLConnection.sendRequest(HessianURLConnection.java:109)
... 21 more
二、找到问题
继续找问题,并且在服务端做了jstack的监控,发现每当出现这种问题的时候,服务端的jstack显示,出现大量的线程block在了log4j打印日志上,jstack如下:
"http-nio-7000-exec-102" #188 daemon prio=5 os_prio=0 tid=0x00007fb15005d000 nid=0x4ae7 waiting for monitor entry [0x00007fb21c34a000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.log4j.Category.callAppenders(Category.java:204)
- waiting to lock <0x000000008002def8> (a org.apache.log4j.Logger)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:305)
.........
.........
at com.sun.proxy.$Proxy9.getAobi(Unknown Source)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.caucho.hessian.server.HessianSkeleton.invoke(HessianSkeleton.java:306)
at com.caucho.hessian.server.HessianSkeleton.invoke(HessianSkeleton.java:202)
at com.caucho.hessian.server.HessianServlet.invoke(HessianServlet.java:389)
at com.caucho.hessian.server.HessianServlet.service(HessianServlet.java:369)
at com.baitian.pay.behaviorlog.BehaviorLogHessianServlet.service(BehaviorLogHessianServlet.java:47)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:291)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:212)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:502)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:141)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:521)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1096)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:674)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1500)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1456)
- locked <0x0000000082822d40> (a org.apache.tomcat.util.net.NioChannel)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x0000000080bbf3d0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
可见,有大量的线程阻塞了,而打印log我们是放在代理上的,所以每次请求都会打印一次log,当请求密集的时候,就会出现大量线程阻塞在log4j的callAppenders上,分析log4j的源码如下:
/**
Call the appenders in the hierrachy starting at
<code>this</code>. If no appenders could be found, emit a
warning.
<p>This method calls all the appenders inherited from the
hierarchy circumventing any evaluation of whether to log or not
to log the particular log request.
@param event the event to log. */
public void callAppenders(LoggingEvent event) {
int writes = 0;
for(Category c = this; c != null; c=c.parent) {
// Protected against simultaneous call to addAppender, removeAppender,...
synchronized(c) {
if(c.aai != null) {
writes += c.aai.appendLoopOnAppenders(event);
}
if(!c.additive) {
break;
}
}
}
if(writes == 0) {
repository.emitNoAppenderWarning(this);
}
}
三、问题原因
查看源码发现,log4j在打log的时候用的是古老的synchronized(c) ,因为行为日志用的是同一个Category,所以所有的服务端处理请求打出行为log的时候都会竞争这个锁(在接口做了代理,打出行为log)。
这就意味着,客户端的每一次请求,都需要获得这个锁,当并发量大的时候就会出现锁竞争的问题。
四、解决问题的方案
当前能想到的解决方案有两个:
- 把日志缓存的队列里面,开出单独的几个线程负责打印log,这样打印log和业务就能分开了,而不会因为打印log影响到正常的业务功能。
- 升级log4j1.x到log4j2.x,官方已经说明,log4j2.x在多线程打印log的性能是log4j1.x的10倍。原文如下:
Log4j 2 contains next-generation Asynchronous Loggers based on the LMAX Disruptor library. In multi-threaded scenarios Asynchronous Loggers have 10 times higher throughput and orders of magnitude lower latency than Log4j 1.x and Logback.