一客户某系统反馈在上周深夜出现严重性能问题,整个系统出现了短暂hung,导致无法使用。通过用户提供的awr元数据,这里做了简单分析:
SQL> select a.instance_number, 2 TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss'), 3 b.EVENT_NAME, 4 count(1) 5 from WRH$_ACTIVE_SESSION_HISTORY a,WRH$_EVENT_NAME b 6 where a.dbid=3866720943 7 and a.event_id=b.event_id 8 and TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss') > '2022-11-18 23:23:00' 9 and TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss') < '2022-11-18 23:33:00' 10 ---and b.WAIT_CLASS<> 'Idle' and b.event_name not like '%SQL*Net%' 11 12 and b.event_name not like 'gc%%' group by a.instance_number, TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss'), 13 14 15 b.EVENT_NAME 16 order by 1, 2; INSTANCE_NUMBER TO_CHAR(A.SAMPLE_TIME,'YYYY-MM-DDHH24: EVENT_NAME COUNT(1) --------------- -------------------------------------- ---------------------------------------- ---------- 1 2022-11-18 23:27:59 CGS wait for IPC msg 2 1 2022-11-18 23:27:59 buffer busy waits 58 1 2022-11-18 23:27:59 db file scattered read 2 1 2022-11-18 23:27:59 db file sequential read 2 1 2022-11-18 23:27:59 enq: SQ - contention 308 1 2022-11-18 23:27:59 enq: TX - index contention 150 1 2022-11-18 23:27:59 gc buffer busy acquire 400 1 2022-11-18 23:27:59 gc current block busy 4 1 2022-11-18 23:27:59 gc current request 378 1 2022-11-18 23:27:59 row cache lock 78 1 2022-11-18 23:28:09 buffer busy waits 58 1 2022-11-18 23:28:09 enq: SQ - contention 308 1 2022-11-18 23:28:09 enq: TX - index contention 150 1 2022-11-18 23:28:09 gc buffer busy acquire 402 1 2022-11-18 23:28:09 gc cr multi block request 2 1 2022-11-18 23:28:09 gc current block busy 4 1 2022-11-18 23:28:09 gc current request 378 1 2022-11-18 23:28:09 row cache lock 78 1 2022-11-18 23:28:19 buffer busy waits 58 1 2022-11-18 23:28:19 db file scattered read 2 1 2022-11-18 23:28:19 enq: SQ - contention 308 1 2022-11-18 23:28:19 enq: TX - index contention 150 1 2022-11-18 23:28:19 gc buffer busy acquire 404 1 2022-11-18 23:28:19 gc current block busy 6 1 2022-11-18 23:28:19 gc current request 378 1 2022-11-18 23:28:19 row cache lock 78 1 2022-11-18 23:28:30 buffer busy waits 58 1 2022-11-18 23:28:30 db file scattered read 2 1 2022-11-18 23:28:30 enq: SQ - contention 308 1 2022-11-18 23:28:30 enq: TX - index contention 150 1 2022-11-18 23:28:30 gc buffer busy acquire 416 1 2022-11-18 23:28:30 gc current block busy 16 1 2022-11-18 23:28:30 gc current request 378 1 2022-11-18 23:28:30 row cache lock 78 1 2022-11-18 23:28:40 buffer busy waits 58 1 2022-11-18 23:28:40 db file scattered read 2 1 2022-11-18 23:28:40 enq: SQ - contention 308 1 2022-11-18 23:28:40 enq: TX - index contention 150 1 2022-11-18 23:28:40 gc buffer busy acquire 416 1 2022-11-18 23:28:40 gc current request 394 1 2022-11-18 23:28:40 row cache lock 78 1 2022-11-18 23:28:50 buffer busy waits 58 1 2022-11-18 23:28:50 control file sequential read 2 1 2022-11-18 23:28:50 db file scattered read 2 1 2022-11-18 23:28:50 enq: SQ - contention 310 1 2022-11-18 23:28:50 enq: TX - index contention 150 1 2022-11-18 23:28:50 gc buffer busy acquire 414 1 2022-11-18 23:28:50 gc current block busy 16 1 2022-11-18 23:28:50 gc current request 378 1 2022-11-18 23:28:50 reliable message 2 1 2022-11-18 23:28:50 row cache lock 78 1 2022-11-18 23:29:01 buffer busy waits 58 1 2022-11-18 23:29:01 db file scattered read 2 1 2022-11-18 23:29:01 enq: SQ - contention 312 1 2022-11-18 23:29:01 enq: TX - index contention 150 1 2022-11-18 23:29:01 gc buffer busy acquire 416 1 2022-11-18 23:29:01 gc current block busy 18 1 2022-11-18 23:29:01 gc current request 378 1 2022-11-18 23:29:01 row cache lock 78 1 2022-11-18 23:29:12 SQL*Net message to client 2 1 2022-11-18 23:29:12 buffer busy waits 58 1 2022-11-18 23:29:12 db file scattered read 2 1 2022-11-18 23:29:12 enq: SQ - contention 314 1 2022-11-18 23:29:12 enq: TX - index contention 150 1 2022-11-18 23:29:12 gc buffer busy acquire 416 1 2022-11-18 23:29:12 gc current block busy 20 1 2022-11-18 23:29:12 gc current request 378 1 2022-11-18 23:29:12 row cache lock 78 1 2022-11-18 23:29:22 buffer busy waits 58 1 2022-11-18 23:29:22 db file scattered read 2 1 2022-11-18 23:29:22 enq: SQ - contention 314 1 2022-11-18 23:29:22 enq: TX - index contention 150 1 2022-11-18 23:29:22 gc buffer busy acquire 416 1 2022-11-18 23:29:22 gc current block 2-way 2 1 2022-11-18 23:29:22 gc current block busy 20 1 2022-11-18 23:29:22 gc current request 378 1 2022-11-18 23:29:22 row cache lock 78 1 2022-11-18 23:29:32 buffer busy waits 58 1 2022-11-18 23:29:32 db file scattered read 2 1 2022-11-18 23:29:32 enq: SQ - contention 316 1 2022-11-18 23:29:32 enq: TX - index contention 150 1 2022-11-18 23:29:32 gc buffer busy acquire 416 1 2022-11-18 23:29:32 gc current block busy 20 1 2022-11-18 23:29:32 gc current request 378 1 2022-11-18 23:29:32 row cache lock 78 1 2022-11-18 23:29:43 buffer busy waits 58 1 2022-11-18 23:29:43 db file scattered read 2 1 2022-11-18 23:29:43 enq: SQ - contention 318 1 2022-11-18 23:29:43 enq: TX - index contention 150 1 2022-11-18 23:29:43 gc buffer busy acquire 418 1 2022-11-18 23:29:43 gc cr multi block request 2 1 2022-11-18 23:29:43 gc current block busy 26 1 2022-11-18 23:29:43 gc current request 380 1 2022-11-18 23:29:43 row cache lock 78 1 2022-11-18 23:29:53 buffer busy waits 58 1 2022-11-18 23:29:53 control file sequential read 4 1 2022-11-18 23:29:53 db file scattered read 2 1 2022-11-18 23:29:53 enq: SQ - contention 318 1 2022-11-18 23:29:53 enq: TX - index contention 150 1 2022-11-18 23:29:53 gc buffer busy acquire 418 1 2022-11-18 23:29:53 gc current block busy 26 1 2022-11-18 23:29:53 gc current request 380 1 2022-11-18 23:29:53 row cache lock 78 1 2022-11-18 23:30:04 buffer busy waits 58 1 2022-11-18 23:30:04 db file scattered read 2 1 2022-11-18 23:30:04 enq: SQ - contention 318 1 2022-11-18 23:30:04 enq: TX - index contention 150 1 2022-11-18 23:30:04 gc buffer busy acquire 418 1 2022-11-18 23:30:04 gc current block busy 26 1 2022-11-18 23:30:04 gc current request 378 1 2022-11-18 23:30:04 row cache lock 78 1 2022-11-18 23:30:14 buffer busy waits 58 1 2022-11-18 23:30:14 db file scattered read 2 1 2022-11-18 23:30:14 enq: SQ - contention 318 1 2022-11-18 23:30:14 enq: TX - index contention 150 1 2022-11-18 23:30:14 gc buffer busy acquire 418 1 2022-11-18 23:30:14 gc current block busy 26 1 2022-11-18 23:30:14 gc current request 378 1 2022-11-18 23:30:14 gc current retry 2 1 2022-11-18 23:30:14 row cache lock 78 1 2022-11-18 23:30:24 buffer busy waits 58 1 2022-11-18 23:30:24 db file scattered read 2 1 2022-11-18 23:30:24 enq: SQ - contention 318 1 2022-11-18 23:30:24 enq: TX - index contention 150 1 2022-11-18 23:30:24 gc buffer busy acquire 418 1 2022-11-18 23:30:24 gc current block busy 26 1 2022-11-18 23:30:24 gc current request 378 1 2022-11-18 23:30:24 gc current retry 2 1 2022-11-18 23:30:24 row cache lock 78 1 2022-11-18 23:30:35 buffer busy waits 58 1 2022-11-18 23:30:35 enq: SQ - contention 318 1 2022-11-18 23:30:35 enq: TX - index contention 150 1 2022-11-18 23:30:35 gc buffer busy acquire 418 1 2022-11-18 23:30:35 gc cr multi block request 2 1 2022-11-18 23:30:35 gc cr request 2 1 2022-11-18 23:30:35 gc current block busy 26 1 2022-11-18 23:30:35 gc current request 380 1 2022-11-18 23:30:35 row cache lock 80 1 2022-11-18 23:30:45 buffer busy waits 58 1 2022-11-18 23:30:45 db file scattered read 2 1 2022-11-18 23:30:45 enq: SQ - contention 318 1 2022-11-18 23:30:45 enq: TX - index contention 150 1 2022-11-18 23:30:45 gc buffer busy acquire 418 1 2022-11-18 23:30:45 gc cr block 2-way 2 1 2022-11-18 23:30:45 gc current block 2-way 2 1 2022-11-18 23:30:45 gc current block busy 26 1 2022-11-18 23:30:45 gc current request 380 1 2022-11-18 23:30:45 row cache lock 80 1 2022-11-18 23:30:55 buffer busy waits 58 1 2022-11-18 23:30:55 db file scattered read 2 1 2022-11-18 23:30:55 db file sequential read 2 1 2022-11-18 23:30:55 enq: SQ - contention 318 1 2022-11-18 23:30:55 enq: TX - index contention 150 1 2022-11-18 23:30:55 gc buffer busy acquire 418 1 2022-11-18 23:30:55 gc current block busy 26 1 2022-11-18 23:30:55 gc current request 380 1 2022-11-18 23:30:55 row cache lock 80 1 2022-11-18 23:31:06 buffer busy waits 58 1 2022-11-18 23:31:06 db file parallel write 2 1 2022-11-18 23:31:06 db file scattered read 2 1 2022-11-18 23:31:06 enq: SQ - contention 318 1 2022-11-18 23:31:06 enq: TX - index contention 150 1 2022-11-18 23:31:06 gc buffer busy acquire 418 1 2022-11-18 23:31:06 gc current block busy 26 1 2022-11-18 23:31:06 gc current request 378 1 2022-11-18 23:31:06 gc current retry 2 1 2022-11-18 23:31:06 row cache lock 80 1 2022-11-18 23:31:17 buffer busy waits 58 1 2022-11-18 23:31:17 enq: SQ - contention 318 1 2022-11-18 23:31:17 enq: TX - index contention 150 1 2022-11-18 23:31:17 gc buffer busy acquire 418 1 2022-11-18 23:31:17 gc cr multi block request 2 1 2022-11-18 23:31:17 gc current block busy 26 1 2022-11-18 23:31:17 gc current request 378 1 2022-11-18 23:31:17 gc current retry 2 1 2022-11-18 23:31:17 row cache lock 80 1 2022-11-18 23:31:27 buffer busy waits 58 1 2022-11-18 23:31:27 db file scattered read 2 1 2022-11-18 23:31:27 enq: SQ - contention 318 1 2022-11-18 23:31:27 enq: TX - index contention 150 1 2022-11-18 23:31:27 gc buffer busy acquire 418 1 2022-11-18 23:31:27 gc current block busy 26 1 2022-11-18 23:31:27 gc current request 378 1 2022-11-18 23:31:27 row cache lock 80 2 2022-11-18 23:30:25 buffer busy waits 346 2 2022-11-18 23:30:25 control file sequential read 2 2 2022-11-18 23:30:25 db file parallel write 4 2 2022-11-18 23:30:25 enq: FB - contention 2 2 2022-11-18 23:30:25 enq: SQ - contention 186 2 2022-11-18 23:30:25 enq: TX - index contention 2 2 2022-11-18 23:30:25 gc buffer busy release 32 2 2022-11-18 23:30:25 log file switch (checkpoint incomplete) 3560 2 2022-11-18 23:30:25 row cache lock 14 2 2022-11-18 23:30:25 write complete waits 46 2 2022-11-18 23:30:35 SQL*Net more data to client 2 2 2022-11-18 23:30:35 buffer busy waits 346 2 2022-11-18 23:30:35 control file sequential read 2 2 2022-11-18 23:30:35 db file parallel write 4 2 2022-11-18 23:30:35 enq: FB - contention 2 2 2022-11-18 23:30:35 enq: SQ - contention 186 2 2022-11-18 23:30:35 enq: TX - index contention 2 2 2022-11-18 23:30:35 gc buffer busy release 34 2 2022-11-18 23:30:35 log file switch (checkpoint incomplete) 3562 2 2022-11-18 23:30:35 row cache lock 14 2 2022-11-18 23:30:35 write complete waits 46 2 2022-11-18 23:30:46 buffer busy waits 346 2 2022-11-18 23:30:46 db file parallel write 4 2 2022-11-18 23:30:46 enq: FB - contention 2 2 2022-11-18 23:30:46 enq: SQ - contention 186 2 2022-11-18 23:30:46 enq: TX - index contention 2 2 2022-11-18 23:30:46 gc buffer busy release 34 2 2022-11-18 23:30:46 log file switch (checkpoint incomplete) 3572 2 2022-11-18 23:30:46 row cache lock 14 2 2022-11-18 23:30:46 write complete waits 46 2 2022-11-18 23:30:57 CSS operation: action 2 2 2022-11-18 23:30:57 buffer busy waits 346 2 2022-11-18 23:30:57 control file sequential read 2 2 2022-11-18 23:30:57 db file parallel write 4 2 2022-11-18 23:30:57 enq: FB - contention 2 2 2022-11-18 23:30:57 enq: SQ - contention 186 2 2022-11-18 23:30:57 enq: TX - index contention 2 2 2022-11-18 23:30:57 gc buffer busy release 32 2 2022-11-18 23:30:57 latch: redo writing 4 2 2022-11-18 23:30:57 log file switch (checkpoint incomplete) 3578 2 2022-11-18 23:30:57 row cache lock 14 2 2022-11-18 23:30:57 write complete waits 50 2 2022-11-18 23:31:07 buffer busy waits 346 2 2022-11-18 23:31:07 db file parallel write 2 2 2022-11-18 23:31:07 enq: FB - contention 2 2 2022-11-18 23:31:07 enq: SQ - contention 188 2 2022-11-18 23:31:07 enq: TX - index contention 2 2 2022-11-18 23:31:07 gc buffer busy release 32 2 2022-11-18 23:31:07 log file switch (checkpoint incomplete) 3584 2 2022-11-18 23:31:07 row cache lock 14 2 2022-11-18 23:31:07 write complete waits 50 2 2022-11-18 23:31:18 buffer busy waits 346 2 2022-11-18 23:31:18 control file sequential read 2 2 2022-11-18 23:31:18 db file parallel write 4 2 2022-11-18 23:31:18 enq: FB - contention 2 2 2022-11-18 23:31:18 enq: SQ - contention 192 2 2022-11-18 23:31:18 enq: TX - index contention 2 2 2022-11-18 23:31:18 gc buffer busy release 32 2 2022-11-18 23:31:18 log file switch (checkpoint incomplete) 3586 2 2022-11-18 23:31:18 row cache lock 14 2 2022-11-18 23:31:18 write complete waits 50 2 2022-11-18 23:31:29 buffer busy waits 346 2 2022-11-18 23:31:29 db file parallel write 4 2 2022-11-18 23:31:29 enq: FB - contention 2 2 2022-11-18 23:31:29 enq: SQ - contention 192 2 2022-11-18 23:31:29 enq: TX - index contention 2 2 2022-11-18 23:31:29 gc buffer busy release 32 2 2022-11-18 23:31:29 log file switch (checkpoint incomplete) 3586 2 2022-11-18 23:31:29 row cache lock 14 2 2022-11-18 23:31:29 write complete waits 50 254 rows selected.
从上面的查询接过来有如下几个疑点:
1、节点1实例从23:27:59开始突然出现大量gc等待,以及enq: SQ – contention,包括buffer busy waits等。
2、23:23 — 23:27:59之间,节点1 无任何相关等待记录
3、用户反馈的日志切换无法完成即log file switch(checkpoint incomplete)在节点2的上是突然出现的,时间点是23:30:25
4、节点2在23:30之前到23:23分之间这7分钟,没有任何性能数据记录。
从上述等待的情况和规律,我们可以大致得到如下几个基本结论:
log file switch等待属于叠加现象,并非根本原因,因为在此之前系统已经运行极为缓慢了,其次节点1在27分突然出现大量gc等待,可排除SQL本身性能影响,如执行计划突然改变导致性能变差,进而产生大量gc,引发该问题。尽管如此,这里我仍然查了一下gc top SQL:
SQL> SQL> select a.instance_number, 2 TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss'), b.event_name, a.sql_id, count(1) from WRH$_ACTIVE_SESSION_HISTORY a,WRH$_EVENT_NAME b where a.dbid=3866720943 and a.event_id=b.event_id 3 4 5 6 7 8 9 and TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss') = '2022-11-18 23:27:59' 10 ---and TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss') < '2022-11-18 23:31:30' 11 and b.WAIT_CLASS<> 'Idle' 12 and b.event_name like 'gc%' 13 ---and a.INSTANCE_NUMBER=1 14 group by a.instance_number, 15 TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss'), b.event_name, 16 17 a.sql_id order by 1, 2; 18 INSTANCE_NUMBER TO_CHAR(A.SAMPLE_TIME,'YYYY-MM-DDHH24: EVENT_NAME SQL_ID COUNT(1) --------------- -------------------------------------- ---------------------------------------- -------------------------- ---------- 1 2022-11-18 23:27:59 gc buffer busy acquire 070x0h38a9wdj 2 1 2022-11-18 23:27:59 gc buffer busy acquire 14tfmqd4n7nm5 2 1 2022-11-18 23:27:59 gc buffer busy acquire 161cd6mnw8ng6 6 1 2022-11-18 23:27:59 gc buffer busy acquire 311un5s5ug5yz 2 1 2022-11-18 23:27:59 gc buffer busy acquire 41uxzsq73sqya 4 1 2022-11-18 23:27:59 gc buffer busy acquire 42ck1jx53qubd 2 1 2022-11-18 23:27:59 gc buffer busy acquire 44xyf34792k2d 4 1 2022-11-18 23:27:59 gc buffer busy acquire 575n36005r5u3 2 1 2022-11-18 23:27:59 gc buffer busy acquire 5m068a02bdzm3 2 1 2022-11-18 23:27:59 gc buffer busy acquire 66cg7zfgzgkc2 4 1 2022-11-18 23:27:59 gc buffer busy acquire 6pyp8r5zvk2fp 4 1 2022-11-18 23:27:59 gc buffer busy acquire 6tzsyrszbuwj4 4 1 2022-11-18 23:27:59 gc buffer busy acquire 79wbq83vc3ty4 112 1 2022-11-18 23:27:59 gc buffer busy acquire 7wrp7q9027usg 46 1 2022-11-18 23:27:59 gc buffer busy acquire 7y9dhybsfd028 2 1 2022-11-18 23:27:59 gc buffer busy acquire 8hpztpr1vx9gp 2 1 2022-11-18 23:27:59 gc buffer busy acquire 97khkmrn4kp09 4 1 2022-11-18 23:27:59 gc buffer busy acquire atuz9t4aqvdmm 156 1 2022-11-18 23:27:59 gc buffer busy acquire azr364vxnbwm2 4 1 2022-11-18 23:27:59 gc buffer busy acquire cd3dxusq5djgp 2 1 2022-11-18 23:27:59 gc buffer busy acquire cmnp0rx2k4yz3 6 1 2022-11-18 23:27:59 gc buffer busy acquire d1sh0yu458vfr 2 1 2022-11-18 23:27:59 gc buffer busy acquire d93suk0kjny04 6 1 2022-11-18 23:27:59 gc buffer busy acquire dpsxw3rkuvkqq 4 1 2022-11-18 23:27:59 gc buffer busy acquire f88a91rzaaj5t 2 1 2022-11-18 23:27:59 gc buffer busy acquire fdabpgn8h14ba 4 1 2022-11-18 23:27:59 gc buffer busy acquire fjqnh3937fk06 2 1 2022-11-18 23:27:59 gc buffer busy acquire fzcvkj0mp6wmc 4 1 2022-11-18 23:27:59 gc buffer busy acquire gh6t7aztdn8qu 2 1 2022-11-18 23:27:59 gc buffer busy acquire 2 1 2022-11-18 23:27:59 gc current block busy 5m068a02bdzm3 4 1 2022-11-18 23:27:59 gc current request 00pkrmq6sxw9r 2 1 2022-11-18 23:27:59 gc current request 038d01h60v2zt 2 1 2022-11-18 23:27:59 gc current request 062j8smyzkhwh 2 1 2022-11-18 23:27:59 gc current request 070x0h38a9wdj 2 1 2022-11-18 23:27:59 gc current request 09xfwud5jxbw8 2 1 2022-11-18 23:27:59 gc current request 0dcjdxx0n8cw6 2 1 2022-11-18 23:27:59 gc current request 0fkgdvsz385af 2 1 2022-11-18 23:27:59 gc current request 161cd6mnw8ng6 4 1 2022-11-18 23:27:59 gc current request 17n64vj7dcbyt 6 1 2022-11-18 23:27:59 gc current request 18ard9mx5qvng 2 1 2022-11-18 23:27:59 gc current request 1abtyr0fgr9df 2 1 2022-11-18 23:27:59 gc current request 1qqvtj9u6kyuq 4 1 2022-11-18 23:27:59 gc current request 1qyp832vkzmdy 2 1 2022-11-18 23:27:59 gc current request 1vukxmd490wqd 2 1 2022-11-18 23:27:59 gc current request 2czdn0wnnpjg2 2 1 2022-11-18 23:27:59 gc current request 2gw2xmvr7vtv1 2 1 2022-11-18 23:27:59 gc current request 3fgygmw80svq4 2 1 2022-11-18 23:27:59 gc current request 3gh06zjqtdu2m 8 1 2022-11-18 23:27:59 gc current request 3nht95b2va65x 6 1 2022-11-18 23:27:59 gc current request 3uz6p5nqf6304 4 1 2022-11-18 23:27:59 gc current request 3z3fgx3y6k3gv 2 1 2022-11-18 23:27:59 gc current request 44xyf34792k2d 4 1 2022-11-18 23:27:59 gc current request 4hzdv022v5cp6 6 1 2022-11-18 23:27:59 gc current request 4ncqpqyqhzw7m 2 1 2022-11-18 23:27:59 gc current request 53cz8x8znkmcv 6 1 2022-11-18 23:27:59 gc current request 5amw15qxajduk 2 1 2022-11-18 23:27:59 gc current request 5b4hjw5y8c3pn 6 1 2022-11-18 23:27:59 gc current request 5ktnsgz6zt67h 2 1 2022-11-18 23:27:59 gc current request 5m068a02bdzm3 8 1 2022-11-18 23:27:59 gc current request 65pm9rq1x691x 2 1 2022-11-18 23:27:59 gc current request 667z1rnng9pub 2 1 2022-11-18 23:27:59 gc current request 66cg7zfgzgkc2 6 1 2022-11-18 23:27:59 gc current request 6d7wmntzx7x5b 2 1 2022-11-18 23:27:59 gc current request 6tfq2fcm1rvqy 2 1 2022-11-18 23:27:59 gc current request 6ttzmtxpcx7tw 2 1 2022-11-18 23:27:59 gc current request 6tzsyrszbuwj4 20 1 2022-11-18 23:27:59 gc current request 737yqgwjnvxxf 6 1 2022-11-18 23:27:59 gc current request 79wbq83vc3ty4 8 1 2022-11-18 23:27:59 gc current request 7c2tgwdf07av1 4 1 2022-11-18 23:27:59 gc current request 7fths5kjcw7za 2 1 2022-11-18 23:27:59 gc current request 7jb8htya1uumm 4 1 2022-11-18 23:27:59 gc current request 7rh4jsd9pt3tn 6 1 2022-11-18 23:27:59 gc current request 7wrp7q9027usg 10 1 2022-11-18 23:27:59 gc current request 7y9dhybsfd028 4 1 2022-11-18 23:27:59 gc current request 86369w0pb0rhc 2 1 2022-11-18 23:27:59 gc current request 8hpztpr1vx9gp 16 1 2022-11-18 23:27:59 gc current request 8jbd7zcqr8kzg 8 1 2022-11-18 23:27:59 gc current request 8uz1jman9rqf2 4 1 2022-11-18 23:27:59 gc current request 94fr6pbh6yd6q 4 1 2022-11-18 23:27:59 gc current request 94z0rmj029ad9 4 1 2022-11-18 23:27:59 gc current request 97khkmrn4kp09 10 1 2022-11-18 23:27:59 gc current request 9cbxmhp0r4jqg 6 1 2022-11-18 23:27:59 gc current request 9xdd22c1guxsj 2 1 2022-11-18 23:27:59 gc current request avph9wxh79p42 4 1 2022-11-18 23:27:59 gc current request aw43z1x20qp6w 2 1 2022-11-18 23:27:59 gc current request azr364vxnbwm2 20 1 2022-11-18 23:27:59 gc current request b36hnuv6fzt97 4 1 2022-11-18 23:27:59 gc current request cc6dknr8grn7s 4 1 2022-11-18 23:27:59 gc current request chrhs04u5yd9p 6 1 2022-11-18 23:27:59 gc current request cmnp0rx2k4yz3 2 1 2022-11-18 23:27:59 gc current request cz5q8x68gg32j 12 1 2022-11-18 23:27:59 gc current request d0z8qt8nas9vu 2 1 2022-11-18 23:27:59 gc current request d93suk0kjny04 20 1 2022-11-18 23:27:59 gc current request f7f8g58wgmjpf 2 1 2022-11-18 23:27:59 gc current request f88a91rzaaj5t 10 1 2022-11-18 23:27:59 gc current request f94mgcvs2as5b 2 1 2022-11-18 23:27:59 gc current request fav0q2jkywa7v 2 1 2022-11-18 23:27:59 gc current request fdabpgn8h14ba 10 1 2022-11-18 23:27:59 gc current request fjqnh3937fk06 2 1 2022-11-18 23:27:59 gc current request fvdyxajud739z 8 1 2022-11-18 23:27:59 gc current request fzcvkj0mp6wmc 22 1 2022-11-18 23:27:59 gc current request g97hgk75m2cd0 8 1 2022-11-18 23:27:59 gc current request gnmuju4p9bumr 2
对于上述几个top SQL的 gc等待,让客户查询了相关执行计划,plan hash value并没有改变过,跟ash数据表现一致。
到这里就比较有意思了,既然并非SQL本身性能问题,为什么会突然出现400个gc 等待呢?(因为如果是SQL性能问题,通常是逐步叠加等待,并非突然一下就会涌现400个出来)。
分析到这里,我们认为有几个可能性:
1、在23:27分这个时间点,集群心跳有问题,导致gc 操作受到较大影响
2、数据库实例节点1,或者节点2 在23:27:59之前出现了异常,比如核心进程hung,实例hung,挂起等状态,进而影响数据库gc
3、数据库实例1,或者实例2 在23:27:59之前cpu资源异常,如系统cpu 100%等情况,导致lgwr,lms等进程无法获取cpu资源,进而导致该问题。
对于第一个可能性,让客户检查了zabbix监控,并没有发现异常,只是从监控来看,心跳流量较高,高峰期维持在60-70Mb/s,对于比较老旧的小型机环境来讲,相对有一些偏高,但还不不足以产生这么大的影响。
同样监控也发现cpu 资源相对是正常的,因此我们可以排除第1,3 两种可能性原因。
带着这种疑问和猜测,远程了客户环境,检查了节点2 trace目录,发现在当天晚上确实产生了一个diag dump,找了如下内容:
*** 2022-11-18 23:25:09.055 HM: Short Stack of immediate waiter session ID 12642, OSPID 16261 of hang ID 67 *** 2022-11-18 23:25:09.191 Short stack dump: ksedsts()+592<-ksdxfstk()+48<-ksdxcb()+3376<-sspuser()+688<-<kernel><-_pw_wait()+48<-pw_wait()+112<-sskgpwwait()+432<-skgpwwait()+336<-ksliwat()+3552<-kslwaitctx()+304<-kcrfws()+864<-kcbchg1_main()+26976<-kcbchg1()+352<-ktubgn()+2224<-knledir()+1424<-knleEva()+160<-evaopn2()+1792<-qesltcEvalOutofLineCols()+1568<-qesltcBeforeRowProcessing()+1184<-qerltcNoKdtBufferedInsRowCBK()+464<-qerltcSingleRowLoad()+640<-qerltcFetch()+720<-insexe()+1600<-opiexe()+16336<-kpoal8()+4832<-opiodr()+1920<-ttcpip()+1744<-opitsk()+2992<-opiino()+1696<-opiodr()+1920<-opidrv()+1616<-sou2o()+256<-opimai_real()+624<-ssthrdmain()+608<-main()+336<-main_opd_entry()+80 HM: Short Stack of root session ID 4510, OSPID 3151 of hang ID 67 Short stack dump: ksedsts()+592<-ksdxfstk()+48<-ksdxcb()+3376<-sspuser()+688<-<kernel><-_read_sys()+48<-_read()+224<-$cold_skgfqio()+880<-ksfd_skgfqio()+416<-ksfd_io()+1184<-ksfdread()+992<-kfk_ufs_sync_io()+1568<-kfk_submit_ufs_io()+416<-kfk_submit_io()+224<-kfk_io1()+928<-kfkRequest()+48<-kfk_transitIO()+2608<-kfioSubmitIO()+3248<-kfioRequestPriv()+208<-kfioRequest()+880<-ksfd_kfioRequest()+800<-ksfd_osmio()+624<-$cold_ksfd_io()+464<-ksfdread()+992<-kccrbp()+736<-kccbmp_get()+608<-kccsed_rbl()+432<-kccocx()+2016<-kcc_begin_txn_internal()+272<-kcrfwl()+544<-ksbabs()+1520<-ksbrdp()+2832<-opirip()+1312<-opidrv()+1152<-sou2o()+256<-opimai_real()+352<-ssthrdmain()+608<-main()+336<-main_opd_entry()+80
可以看到在23:25:09分,显示节点2 的Session 4510 会话出于hang的状态。那么该会话是什么呢?
Blocking Sid (Inst) | % Activity | Event Caused | % Event | User | Program | # Samples Active | XIDs |
---|---|---|---|---|---|---|---|
4510, 1( 2) | 57.91 | log file switch (checkpoint incomplete) | 54.11 | SYS | oracle@fdcdb2 (LGWR) | 41/120 [ 34%] | 0 |
4510, 1( 2)
|
57.91404374801865857524568633666953489425
|
log file sync | 3.77 |
SYS
|
oracle@fdcdb2 (LGWR)
|
41/120 [ 34%]
|
0
|
6953, 707( 2) | 2.19 | enq: SQ – contention | 2.19 | FDCMGR | JDBC Thin Client | 7/120 [ 6%] | 0 |
3771,39737( 2) | 1.71 | buffer busy waits | 1.71 | FDCMGR | JDBC Thin Client | 7/120 [ 6%] | 0 |
从ASH 报告我们可以得知4510 session 就是我们的LGWR 进程。可见在23:25:09 节点2 lgwr进程就出现了异常。
那么这就能解释通了。。。
我们继续带着疑问分析这个问题,因为过一段时间就自动恢复了,我始终怀疑是应用可能有些不正常,因此让客户检查了之前提到的2个TOP gc 等待的SQL,发现该sql在最近一周的执行计划,分别是13号,16号,18号。 其中这3个时间点都是业务出问题的时间点。唯一不同的是:
13号sql执行次数是9w次,16号是15w次,18号是17w次。根据用户的反馈,13号和16号也出现了类似问题,但是影响要比18号要小,尤其是13号,影响比较短暂。
结合实际情况和现象来看,我认为问题本身还是出现了这个业务模块上,可能是非固定时间执行,当业务量过大时,进而产生了大量gc,最终出现了该问题;实际上从统计来看,随着执行次数增加,sql的单次平均执行时间逐渐拉长。
最后,不得不说,对于这个比较大型的业务系统来讲,单个节点Session在3800-4000左右,每秒redo 在30M/s(单节点). 这么古老的小型机改换了。。。。
本以为故事即将终结,今天客户反馈早上又出现了该问题,虽然影响不大,仍然需要再次分析,从用户提供的diag trace文件来看,跟之前的问题看上去有一些不同:
------------------------------------------------------------------------------- Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (xxxxx.xxxxx1) os id: 5621 process id: 3840, oracle@xxxxx1 session id: 17 session serial #: 62917 } is waiting for 'log file switch (checkpoint incomplete)' with wait info: { time in wait: 17.585836 sec (last interval) time in wait: 44.028985 sec (total) timeout after: never wait id: 13 blocking: 0 sessions wait history: * time between current wait and wait #1: 0.000000 sec 1. event: 'latch: redo writing' time waited: 0.008565 sec wait id: 19 p1: 'address'=0xc000000087dfeb48 p2: 'number'=0xcf p3: 'tries'=0x0 * time between wait #1 and #2: 0.000000 sec 2. event: 'log file switch (checkpoint incomplete)' time waited: 3.465187 sec (last interval) time waited: 26.434584 sec (total) wait id: 13 * time between wait #2 and #3: 0.000000 sec 3. event: 'latch: redo writing' time waited: 0.008045 sec wait id: 18 p1: 'address'=0xc000000087dfeb48 p2: 'number'=0xcf p3: 'tries'=0x0 } and is blocked by => Oracle session identified by: { instance: 1 (xxxxx.xxxxx1) os id: 26277 process id: 20, oracle@xxxxx1 (LGWR) session id: 4510 session serial #: 1 } which is waiting for 'control file sequential read' with wait info: { p1: 'file#'=0x1 p2: 'block#'=0x1 p3: 'blocks'=0x1 time in wait: 0.000000 sec timeout after: never wait id: 2047504176 blocking: 1249 sessions wait history: * time between current wait and wait #1: 0.000013 sec 1. event: 'control file sequential read' time waited: 0.000157 sec wait id: 2047504175 p1: 'file#'=0x0 p2: 'block#'=0x1 p3: 'blocks'=0x1 * time between wait #1 and #2: 0.000074 sec 2. event: 'KSV master wait' time waited: 0.000013 sec wait id: 2047504174 * time between wait #2 and #3: 0.000034 sec 3. event: 'control file sequential read' time waited: 0.000242 sec wait id: 2047504173 p1: 'file#'=0x0 p2: 'block#'=0x2e p3: 'blocks'=0x1 } Chain 1 Signature: 'control file sequential read'<='log file switch (checkpoint incomplete)' Chain 1 Signature Hash: 0xf23230fd
从hung 分析结果来看,所有会话几乎均被lgwr进程阻塞,被阻塞会话超过1200个。其中lgwr在等待control file sequential read。似乎有一些奇怪?
由于是同一个trace文件,可以看到上次问题也就18号故障时间点的情况:
*** 2022-11-18 23:24:27.017 =============================================================================== END OF HANG ANALYSIS =============================================================================== *** 2022-11-18 23:25:08.510 Suspected Hangs in the System Root Chain Total Hang Hang Hang Inst Root #hung #hung Hang Hang Resolution ID Type Status Num Sess Sess Sess Conf Span Action ----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- 67 HANG VALID 2 4510 4 1446 HIGH GLOBAL Terminate Process inst# SessId Ser# OSPID Event ----- ------ ----- --------- ----- 1 54 16663 9690 enq: TX - index contention 1 6946 47773 9701 gc current request 2 12642 11983 16261 log file switch (checkpoint incomplete) 2 4510 1 3151 control file sequential read ...... *** 2022-11-18 23:25:34.711 Suspected Hangs in the System Root Chain Total Hang Hang Hang Inst Root #hung #hung Hang Hang Resolution ID Type Status Num Sess Sess Sess Conf Span Action ----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- 74 HANG VALID 2 4510 4 2402 HIGH GLOBAL Terminate Process inst# SessId Ser# OSPID Event ----- ------ ----- --------- ----- 1 11 15447 18842 enq: TX - index contention 1 4763 31315 10670 gc current request 2 11642 39 17120 log file switch (checkpoint incomplete) 2 4510 1 3151 not in wait *** 2022-11-18 23:25:41.065 HM: Matched Hang ID=71 first occurred 11/18/2022 23:25:07, matched 1 times 2/345/44037/latch: redo writing <- 2/228/36601/buffer busy waits *** 2022-11-18 23:25:41.065 HM: Matched Hang ID=72 first occurred 11/18/2022 23:25:07, matched 1 times 2/17224/35815/latch: redo writing <- 2/12442/46953/buffer busy waits
可以看到lgwr进程(session 4510) 也是在等待control file sequential read或者就是not in wait。可见本质上这几次问题是一样的。
此时怀疑检查点方面可能出现了问题,导致长时间没有触发,进一步分析之前的awr发现部分数据确实有一些异常:
physical writes from cache 20,738,580 5,733.14 6.66 physical writes non checkpoint 17,877,771 4,942.28 5.74
其中physical writes from cache – physical writes not checkpoint 的部分就是Oracle 增量检查点触发的dbwr physical writes操作。由此可见,增量检查点触发的写比例非常低,也该也就15%。 换句话讲其他85%可能就是db cache自身写脏等操作触发的。
Oracle从 8i引入了增量检查点,10g 版本引入了自动调节增量检查点机制,主要有如下几个相关参数:
NAME VALUE DESCRIB -------------------------------------------------- -------------------- ------------------------------------------------------------ _selftune_checkpoint_write_pct 3 Percentage of total physical i/os for self-tune ckpt _log_checkpoint_recovery_check 0 # redo blocks to verify after checkpoint log_checkpoint_interval 0 # redo blocks checkpoint threshold log_checkpoint_timeout 1800 Maximum time interval between checkpoints in seconds _gc_global_checkpoint_scn TRUE if TRUE, enable global checkpoint scn _disable_incremental_checkpoints FALSE Disable incremental checkpoints for thread recovery _disable_selftune_checkpointing FALSE Disable self-tune checkpointing _selftune_checkpointing_lag 300 Self-tune checkpointing lag the tail of the redo log log_checkpoints_to_alert FALSE log checkpoint begin/end to alert file _kdli_checkpoint_flush FALSE do not invalidate cache buffers after write
Oracle 10g版本开始,默认启用了增量检查点自动调节机制。除了上述跟dbwr写相关的参数之外,还有跟redo log size相关的参数:
NAME VALUE DESCRIB -------------------------------------------------- -------------------- ------------------------------------------------------------ _target_rba_max_lag_percentage 81 target rba max log lag percentage
该参数在11.2版本中默认值为81%,在10g版本中为90%。 该参数大致意思指最大限制情况下,当所有redo log大小写满到其81%的情况下才触发检查点。从mos上文章也看到了一些描述,某些特定情况下增量检查点机制是有待完善的,与cpu_count,log_buffer大小,redo log size,以及mttr有关。
对于用户这个问题,我们建议用户通过如下2个方式调整之后进行观察:
1、log_buffer 调整为300m
2、fast_start_mttr_target 设置为600秒
发表回复