OOM root cause analyse

1 提示如下错误:Connection pool shut down

现象,操作Dynanodb时,提示错误,最开始怀疑是Dynamodb本身问题,导致OOM。实际这个情况应该是OOM导致了这个错误,分析如下:

具体错误为:java.lang.IllegalStateException: Connection pool shut down

查看错误抛出的地方:

//org.apache.http.impl.conn.PoolingHttpClientConnectionManager#requestConnection
    @Override
    public ConnectionRequest requestConnection(
            final HttpRoute route,
            final Object state) {
        Args.notNull(route, "HTTP route");
        if (this.log.isDebugEnabled()) {
            this.log.debug("Connection request: " + format(route, state) + formatStats(route));
        }
        Asserts.check(!this.isShutDown.get(), "Connection pool shut down");
        // ......
        }

归因于从“连接池”中请求连接时,发现连接池被关闭了,所以查看何时会关闭:只有显示关闭shutdown才会。

//org.apache.http.impl.conn.PoolingHttpClientConnectionManager#shutdown
 @Override
    public void shutdown() {
        if (this.isShutDown.compareAndSet(false, true)) {
           //
        }
    }

何时会显示关闭?参考以下代码,遇到ERROR时,会显示关闭。

org.apache.http.impl.execchain.MainClientExec#execute
  @Override
    public CloseableHttpResponse execute(
            final HttpRoute route,
            final HttpRequestWrapper request,
            final HttpClientContext context,
            final HttpExecutionAware execAware) throws IOException, HttpException {
 
        //省略非关键代码:请求连接,执行请求。
        } catch (final Error error) {
            connManager.shutdown();
            throw error;
        }
    }

而java.lang.OutOfMemoryError继承于Error,正是其中一种,所以遇到OOM时,连接池会关闭,后续不可自动恢复,必须重启Application.

BTW: 这种错误属于Fast Fail的错误,反而不会导致OOM。(不考虑一些落内存的重试)

2 Redis Timeout相关?

错误如下:

org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 2 second(s)
    at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:70)
    at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:41)
    at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
    at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:42)
    at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:278)
    at org.springframework.data.redis.connection.lettuce.LettuceConnection.await(LettuceConnection.java:1086)
    at org.springframework.data.redis.connection.lettuce.LettuceConnection.lambda$doInvoke$4(LettuceConnection.java:939)
    at org.springframework.data.redis.connection.lettuce.LettuceInvoker$Synchronizer.invoke(LettuceInvoker.java:665)

排除,发生OOM之前的错误数量(<100)不至于导致OOM.

3 Bug or 正常使用

如果是一般性bug,OOM后,内存不会自动下降,但是参考如下日志可以看出还是会下降的:内存从8322465096下降至6235528672

{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:16:42.222","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":3703050240,"nonHeapUsed":345615112,"ygcCount":4,"ygcCost":244,"fgcCount":0,"fgcCost":0}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:17:42.223","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":6674228736,"nonHeapUsed":346253752,"ygcCount":5,"ygcCost":291,"fgcCount":0,"fgcCost":0}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:18:42.221","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":6962728408,"nonHeapUsed":346507992,"ygcCount":59,"ygcCost":2126,"fgcCount":0,"fgcCost":0}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:20:14.337","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":8322465096,"nonHeapUsed":345714256,"ygcCount":300,"ygcCost":3192,"fgcCount":35,"fgcCost":13357}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:21:24.884","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":8248065216,"nonHeapUsed":346660992,"ygcCount":253,"ygcCost":1954,"fgcCount":161,"fgcCost":62121}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:21:42.235","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":7522034616,"nonHeapUsed":356767008,"ygcCount":212,"ygcCost":2613,"fgcCount":9,"fgcCost":4216}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:22:42.221","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":8340010192,"nonHeapUsed":359838736,"ygcCount":327,"ygcCost":3584,"fgcCount":34,"fgcCost":13066}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:23:42.219","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":6501521464,"nonHeapUsed":358762232,"ygcCount":110,"ygcCost":1575,"fgcCount":1,"fgcCost":432}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:24:42.227","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":6235528672,"nonHeapUsed":359783160,"ygcCount":24,"ygcCost":583,"fgcCount":0,"fgcCost":0}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:25:42.219","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":7306915680,"nonHeapUsed":360388120,"ygcCount":34,"ygcCost":732,"fgcCount":0,"fgcCost":0}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:26:42.219","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":6650218176,"nonHeapUsed":361077272,"ygcCount":40,"ygcCost":885,"fgcCount":0,"fgcCost":0}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:27:45.410","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":8421057232,"nonHeapUsed":361559696,"ygcCount":148,"ygcCost":1594,"fgcCount":39,"fgcCost":14195}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:29:21.789","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":8300889856,"nonHeapUsed":361789168,"ygcCount":187,"ygcCost":1008,"fgcCount":263,"fgcCost":94826}}
{"monitorType":"async_jvm_monitor","logTime":"2022-11-18 09:30:14.178","metricsDetail":{"host":"ip-10-0-1-146.ec2.internal","heapUsed":7992134312,"nonHeapUsed":361961280,"ygcCount":99,"ygcCost":527,"fgcCount":118,"fgcCost":42734}}