当前位置: 首页 > Java > 正文

CMS GC频繁问题的排查

1 星2 星3 星4 星5 星 (2 次投票, 评分: 5.00, 总分: 5)
Loading ... Loading ...
baidu_share

问题现象:
有个应用每天都从半夜3点多开始一直频繁CMS GC,通常要持续到早上9点半以后,GC log部分内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
2013-02-26T03:48:12.799+0800: 35183.494: [GC [1 CMS-initial-mark: 945042K(2146304K)] 1551093K(4023680K), 2.1342540 secs] [Times: user=2.13 sys=0.00, real=2.13 secs]
2013-02-26T03:48:14.934+0800: 35185.629: [CMS-concurrent-mark-start]
2013-02-26T03:48:20.074+0800: 35190.769: [CMS-concurrent-mark: 5.140/5.140 secs] [Times: user=5.13 sys=0.00, real=5.14 secs]
2013-02-26T03:48:20.074+0800: 35190.769: [CMS-concurrent-preclean-start]
2013-02-26T03:48:20.082+0800: 35190.777: [CMS-concurrent-preclean: 0.008/0.008 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2013-02-26T03:48:20.082+0800: 35190.777: [CMS-concurrent-abortable-preclean-start]
CMS: abort preclean due to time 2013-02-26T03:48:25.561+0800: 35196.256: [CMS-concurrent-abortable-preclean: 5.080/5.478 secs] [Times: user=5.06 sys=0.00, real=5.48 secs]
2013-02-26T03:48:25.563+0800: 35196.258: [GC[YG occupancy: 606103 K (1877376 K)]35196.258: [Rescan (parallel) , 1.8514280 secs]35198.110: [weak refs processing, 0.0000370 secs]35198.110: [class unloading, 0.0198590 secs]35198.130: [scrub symbol & string tables, 0.0162050 secs] [1 CMS-remark: 945042K(2146304K)] 1551145K(4023680K), 1.8940180 secs] [Times: user=2.89 sys=0.01, real=1.90 secs]
2013-02-26T03:48:27.458+0800: 35198.153: [CMS-concurrent-sweep-start]
2013-02-26T03:48:32.122+0800: 35202.817: [CMS-concurrent-sweep: 4.665/4.665 secs] [Times: user=4.66 sys=0.07, real=4.66 secs]
2013-02-26T03:48:32.122+0800: 35202.817: [CMS-concurrent-reset-start]
2013-02-26T03:48:32.136+0800: 35202.831: [CMS-concurrent-reset: 0.014/0.014 secs] [Times: user=0.01 sys=0.00, real=0.02 secs]
2013-02-26T03:48:34.139+0800: 35204.834: [GC [1 CMS-initial-mark: 945037K(2146304K)] 1551585K(4023680K), 2.0722950 secs] [Times: user=2.07 sys=0.00, real=2.07 secs]
2013-02-26T03:48:36.213+0800: 35206.908: [CMS-concurrent-mark-start]
2013-02-26T03:48:41.739+0800: 35212.434: [CMS-concurrent-mark: 5.526/5.526 secs] [Times: user=5.52 sys=0.00, real=5.53 secs]
2013-02-26T03:48:41.739+0800: 35212.434: [CMS-concurrent-preclean-start]
2013-02-26T03:48:41.747+0800: 35212.442: [CMS-concurrent-preclean: 0.008/0.008 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2013-02-26T03:48:41.747+0800: 35212.442: [CMS-concurrent-abortable-preclean-start]
CMS: abort preclean due to time 2013-02-26T03:48:46.856+0800: 35217.551: [CMS-concurrent-abortable-preclean: 4.704/5.109 secs] [Times: user=4.69 sys=0.00, real=5.12 secs]
2013-02-26T03:48:46.857+0800: 35217.552: [GC[YG occupancy: 606547 K (1877376 K)]35217.552: [Rescan (parallel) , 1.6143580 secs]35219.166: [weak refs processing, 0.0000270 secs]35219.166: [class unloading, 0.0194720 secs]35219.186: [scrub symbol & string tables, 0.0147580 secs] [1 CMS-remark: 945037K(2146304K)] 1551585K(4023680K), 1.6548450 secs] [Times: user=2.38 sys=0.00, real=1.65 secs]
2013-02-26T03:48:48.512+0800: 35219.207: [CMS-concurrent-sweep-start]
2013-02-26T03:48:51.573+0800: 35222.268: [CMS-concurrent-sweep: 3.062/3.062 secs] [Times: user=3.05 sys=0.00, real=3.06 secs]
2013-02-26T03:48:51.574+0800: 35222.269: [CMS-concurrent-reset-start]
2013-02-26T03:48:51.588+0800: 35222.283: [CMS-concurrent-reset: 0.014/0.014 secs] [Times: user=0.01 sys=0.00, real=0.02 secs]
2013-02-26T03:48:53.588+0800: 35224.283: [GC [1 CMS-initial-mark: 945037K(2146304K)] 1552588K(4023680K), 2.2012850 secs] [Times: user=2.20 sys=0.00, real=2.20 secs]

从日志来看,可以看到,现象非常诡异,这个时候CMS GC的old使用压根没到设定的80%,于是开始排查原因。

排查过程:
1、看启动参数,确认设置了-XX:+UseCMSInitiatingOccupancyOnly,所以不会出现JVM自行触发的现象;
2、看启动参数,确认增加了-XX:+DisableExplicitGC,因此不会出现System.gc主动触发的现象;
3、在频繁CMS GC的时候查看了perm的使用率,也没有到指定的比例大小,所以也不是perm到达比例造成的;
4、这个时候eden也没满,所以也不可能是young gc的悲观策略引发的;
5、于是只能用pstack了,去查是什么地方造成了cms gc的不断执行,通过pstack看到主要都是ConcurrentMarkSweepThread::run中执行了CMSCollector::collect_in_background,翻看了ConcurrentMarkSweepThread::run的代码,执行这个部分的代码主要是:

1
2
3
4
5
while (!_should_terminate) {
    sleepBeforeNextCycle();
    if (_should_terminate) break;
    _collector->collect_in_background(false);  // !clear_all_soft_refs
}

一旦到了collect_in_background这个地方,就已经进入CMS GC的执行过程了…
6、因此问题的关键变成了sleepBeforeNextCycle()这步为什么一直过去,而不是返回,继续看代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
void ConcurrentMarkSweepThread::sleepBeforeNextCycle() {
  while (!_should_terminate) {
    if (CMSIncrementalMode) {
      icms_wait();
      return;
    } else {
      // Wait until the next synchronous GC, a concurrent full gc
      // request or a timeout, whichever is earlier.
      wait_on_cms_lock(CMSWaitDuration);
    }
    // Check if we should start a CMS collection cycle
    if (_collector->shouldConcurrentCollect()) {
      return;
    }
    // .. collection criterion not yet met, let's go back
    // and wait some more
  }
}

没有打开CMSIncrementalMode,所以代码直接进入wait_on_cms_lock(CMSWaitDuration)(CMSWaitDuration默认是2000,意味着在CMS GC一个周期执行完成后每隔2s执行一次,可以对应到GC Log),可以看到这里的关键是_collector->shouldConcurrentCollect(),于是翻看这部分的代码,这段代码主要就是检查是否要执行CMS GC,跳掉自己已经确认上面的1/2/3三种触发条件外,看到还有这段可疑的地方:

1
2
3
4
5
6
if (gch->incremental_collection_will_fail(true /* consult_young */)) {
    if (Verbose && PrintGCDetails) {
      gclog_or_tty->print("CMSCollector: collect because incremental collection will fail ");
    }
    return true;
}

于是继续跟进这部分的代码,发现这段判断是否要执行的逻辑会流转到这段代码:

1
return _next_gen->promotion_attempt_is_safe(used());

而这段代码就是著名的悲观策略代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
bool ConcurrentMarkSweepGeneration::promotion_attempt_is_safe(size_t max_promotion_in_bytes) const {
  size_t available = max_available();
  size_t av_promo  = (size_t)gc_stats()->avg_promoted()->padded_average();
  bool   res = (available >= av_promo) || (available >= max_promotion_in_bytes);
  if (Verbose && PrintGCDetails) {
    gclog_or_tty->print_cr(
      "CMS: promo attempt is%s safe: available("SIZE_FORMAT") %s av_promo("SIZE_FORMAT"),"
      "max_promo("SIZE_FORMAT")",
      res? "":" not", available, res? ">=":"<",
      av_promo, max_promotion_in_bytes);
  }
  return res;
}

于是猜想是这个地方导致了CMS GC频繁执行,而翻查GC log,统计了下之前平均晋升到old的大小,确实会有出现触发悲观策略的问题。

解决方法:
既然是因为触发了悲观策略造成的,所以解决方法就采用了减少新生代的大小,保持旧生代的剩余大小会超过新生代平均晋升的大小。

多说几句:
1、这个应用之所以会出现这种现象,是因为恰好在某个点会有大量消耗内存的现象出现,导致有一段时间会有大量的对象从新生代晋升到旧生代,而恰巧这个应用平时Young GC又很少,因此会导致新生代晋升到旧生代的平均大小一下拉高了很多;

2、其实在这种情况下,CMS GC频繁执行是没有任何意义的,因为旧生代存活的对象就是有那么一些,并不会因为CMS GC频繁执行就让Old的剩余空间变大,我觉得这是JVM自身可以考虑有些优化的地方,否则在这样的场景下,其实会无解,因为这种情况下Young GC触发的时候也会触发悲观策略,新生代晋升到旧生代的平均大小压根就没办法降下去。

本文固定链接: http://www.chepoo.com/often-cms-gc-solution.html | IT技术精华网

CMS GC频繁问题的排查:等您坐沙发呢!

发表评论