本案例来自西区某客户,数据库版本为12.2,6节点rac,当时应用反应异常缓慢。十万火急,需要立马解决。当我接手时,故障已经持续了3个小时了。
当时查询gv$session时,发现大量会话的final blocking session都是6节点的LCK1进程,该进程等待事件为libcache interrupt action by LCK。由于时间紧急,果断的abort掉了6节点的实例,数据库立马恢复了正常。
当分析原因的时候,发现当时数据库hang得非常严重,连ash和awr都没有产生。但是从diag/dia0/lmhb trace中可以发现一些有价值的信息。
首先LCK是rac非常核心的进程,用于同步GES层面的一些实例级别的lock信息,比如row cache、library cache。而lmbh正好会监控这些核心rac核心进程。
从lmhb trace可以看到:
*** 2022-07-31T21:10:01.053896+08:00 (CDB$ROOT(1)) ============================== LCK1 (ospid: 52025) has not moved for 14 sec (1659273000.1659272986) ... ... *** 2022-07-31T23:25:32.096842+08:00 (CDB$ROOT(1)) ============================== LCK1 (ospid: 52025) has not moved for 8146 sec (1659281132.1659272986)
LCK1进程已经hang住了,并且持续了很久。并且lmhb中还记录了LCK1的process信息,包含了call stack。
*** 2022-07-31T21:10:11.326435+08:00 (CDB$ROOT(1)) ============================== LCK1 (ospid: 52025) has not moved for 24 sec (1659273010.1659272986) : heartbeat check status 2 (acceptable) (threshold 70 sec) : heartbeat poke time 0x62e67f24 req 0x120ca4 ack 0x120ca3 freq 10 : heartbeat state 0x41.02 (libcache) pso-flag 0x0 : waiting for event 'libcache interrupt action by LCK' for 19 secs with wait_id 841902246. ===[ Wait Chain ]=== Wait chain is empty. ============================== Dumping PROCESS LCK1 (ospid: 52025) States ============================== ===[ Callstack ]=== *** 2022-07-31T21:10:11.326637+08:00 (CDB$ROOT(1)) Process diagnostic dump for oracle@gapdb4 (LCK1), OS id=52025, pid: 31, proc_ser: 1, sid: 5829, sess_ser: 13529 ------------------------------------------------------------------------------- os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 21:10:10 ] NOTE: scheduling delay has not been sampled for 0.807660 secs 0.000000 secs from [ 21:10:06 - 21:10:11 ], 5 sec avg 0.000000 secs from [ 21:09:12 - 21:10:11 ], 1 min avg 0.000000 secs from [ 21:05:11 - 21:10:11 ], 5 min avg *** 2022-07-31T21:10:13.735792+08:00 (CDB$ROOT(1)) loadavg : 8.68 7.66 7.18 System user time: 0.11 sys time: 0.01 context switch: 89486 Memory (Avail / Total) = 11748.10M / 515601.66M Swap (Avail / Total) = 49147.25M / 49152.00M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 R oracle 52025 1 2 80 0 - 51757151 - Mar16 ? 2-18:57:53 ora_lck1_paascdba6 Name: ora_lck1_paascd State: R (running) Tgid: 52025 Ngid: 0 Pid: 52025 PPid: 1 TracerPid: 0 Uid: 1100 1100 1100 1100 Gid: 1101 1100 1100 1100 FDSize: 64 Groups: 1100 1101 NStgid: 52025 NSpid: 52025 NSpgid: 52025 NSsid: 52025 VmPeak: 207028604 kB VmSize: 207028604 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 67848 kB VmRSS: 67784 kB VmData: 6264 kB VmStk: 152 kB VmExe: 343608 kB VmLib: 33008 kB VmPTE: 568 kB VmPMD: 168 kB VmSwap: 0 kB Threads: 1 SigQ: 0/2062272 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000016400207 SigCgt: 00000003c9887cf8 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 0000003fffffffff Seccomp: 0 Speculation_Store_Bypass: thread vulnerable Cpus_allowed: ffff,ffffffff,ffffffff Cpus_allowed_list: 0-79 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000000f Mems_allowed_list: 0-3 voluntary_ctxt_switches: 4999124562 nonvoluntary_ctxt_switches: 52927652 Short stack dump: ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-kglobscfix_callback()+64<-kglScanReferences()+222<-kglHandleMessage()+118<-kqlmbivg_cbk()+1647<-kqlmbivg()+391<-kqlmba()+513<-ksbcti()+247<-ksbabs()+551<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()+262<-__libc_start_main()+245
从loadavg,包括当时cpu使用率,内存使用率来看,当时负载其实很低,并且LCK1进程处于R状态。call stack为:kglobscfix_callback()+64<-kglScanReferences()+222<-kglHandleMessage()+118<-kqlmbivg_cbk()+1647<-kqlmbivg()+391<-kqlmba()。负载很低的情况下,LCK1 hang住那么很可能是一个bug。
分析diag trace的systemtate dump,可以找到LCK1进程的session state dump信息。
SO: 0x9e4987468, type: 4, owner: 0xa03c72a88, flag: INIT/-/-/-/0x00 i f: 0x3 c: 0x3 proc=0xa03c72a88, name=session, file=ksu.h LINE:15737, pg=0, conuid=1 SGA version=(1,0) (session) sid: 5829 ser: 13529 trans: (nil), creator: 0xa03c72a88 flags: (0x51) USR/- flags2: (0x409) -/-/INC flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/- 0000-0000-000000000000-0000-00000000 DID: , short-term DID: txn branch: (nil) con_id/con_uid/con_name: 1/1/CDB$ROOT con_logonuid: 1 con_logonid: 1 con_scuid: 1 con_scid: 1 edition#: 0 user#/name: 0/SYS oct: 0, prv: 0, sql: (nil), psql: (nil) stats: 0x97fe0b988, PX stats: 0x110e9744 service name: SYS$BACKGROUND Current Wait Stack: 0: waiting for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902246 seq_num=14388 snap_id=1 wait times: snap=16 min 2 sec, exc=16 min 2 sec, total=16 min 2 sec wait times: max=infinite, heur=16 min 2 sec wait counts: calls=0 os=0 in_wait=1 iflags=0x5a2 There are 565 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 13726, ser: 15144 wait event: 'enq: IV - contention' p1: 'type|mode'=0x49560005 p2: 'id1'=0x53594e43 p3: 'id2'=0x80 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 949 secs, waiter_cache_ver: 51635 Wait State: fixed_waits=0 flags=0x22 boundary=(nil)/-1 Session Wait History: elapsed time of 0.000002 sec since current wait 0: waited for 'rdbms ipc message' timeout=0xfd, =0x0, =0x0 wait_id=841902245 seq_num=14387 snap_id=1 wait times: snap=0.000318 sec, exc=0.000318 sec, total=0.000318 sec wait times: max=2.530000 sec wait counts: calls=1 os=1 occurred after 0.000001 sec of elapsed time 1: waited for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902244 seq_num=14386 snap_id=1 wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000002 sec of elapsed time 2: waited for 'rdbms ipc message' timeout=0xfd, =0x0, =0x0 wait_id=841902243 seq_num=14385 snap_id=1 wait times: snap=0.000146 sec, exc=0.000146 sec, total=0.000146 sec wait times: max=2.530000 sec wait counts: calls=1 os=1 occurred after 0.000002 sec of elapsed time 3: waited for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902241 seq_num=14384 snap_id=2 wait times: snap=0.000048 sec, exc=0.000079 sec, total=0.000363 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000000 sec of elapsed time 4: waited for 'enq: IV - contention' type|mode=0x49560003, id1=0x53594e43, id2=0x80 wait_id=841902242 seq_num=14383 snap_id=1 wait times: snap=0.000284 sec, exc=0.000284 sec, total=0.000284 sec wait times: max=infinite wait counts: calls=3 os=3 occurred after 0.000000 sec of elapsed time 5: waited for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902241 seq_num=14382 snap_id=1 wait times: snap=0.000031 sec, exc=0.000031 sec, total=0.000031 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000002 sec of elapsed time 6: waited for 'rdbms ipc message' timeout=0xfd, =0x0, =0x0 wait_id=841902240 seq_num=14381 snap_id=1 wait times: snap=0.000436 sec, exc=0.000436 sec, total=0.000436 sec wait times: max=2.530000 sec wait counts: calls=1 os=1 occurred after 0.000001 sec of elapsed time 7: waited for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902239 seq_num=14380 snap_id=1 wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000002 sec of elapsed time 8: waited for 'rdbms ipc message' timeout=0xfd, =0x0, =0x0 wait_id=841902238 seq_num=14379 snap_id=1 wait times: snap=0.000097 sec, exc=0.000097 sec, total=0.000097 sec wait times: max=2.530000 sec wait counts: calls=1 os=1 occurred after 0.000001 sec of elapsed time 9: waited for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=841902237 seq_num=14378 snap_id=1 wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.000002 sec of elapsed time Sampled Session History of session 5829 serial 13529 --------------------------------------------------- The sampled session history is constructed by sampling the target session every 1 second. The sampling process captures at each sample if the session is in a non-idle wait, an idle wait, or not in a wait. If the session is in a non-idle wait then one interval is shown for all the samples the session was in the same non-idle wait. If the session is in an idle wait or not in a wait for consecutive samples then one interval is shown for all the consecutive samples. Though we display these consecutive samples in a single interval the session may NOT be continuously idle or not in a wait (the sampling process does not know). The history is displayed in reverse chronological order. sample interval: 1 sec, max history 120 sec --------------------------------------------------- [120 samples, 21:23:54 - 21:25:54] waited for 'libcache interrupt action by LCK', seq_num: 14388 p1: 'location'=0x1 p2: ''=0x0 p3: ''=0x0 time_waited: >= 120 sec (still in wait) --------------------------------------------------- Sampled Session History Summary: longest_non_idle_wait: 'libcache interrupt action by LCK' [120 samples, 21:23:54 - 21:25:54] time_waited: >= 120 sec (still in wait) --------------------------------------------------- ---------------------------------------- kpoxc: mode=0 static=0 inReq=0 pendErrInRep=0 callUseSCN=0 nonReq=0 kpoxc: trace=0 roundTrip=0 errCode=0 ctl=0 eocs=0 disErr=0 dscnEnabled=0 temporary object counter: 0 last pdbid to increment: 0, last function to increment: (null) ---------------------------------------- Virtual Thread: kgskvt: 0x9e9369820, sess: 0x9e4987468/1 sid: 5829 ser: 13529 vc: (nil), proc: 0xa03c72a88, id: 5829 consumer group cur: _ORACLE_BACKGROUND_GROUP_/1 (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig: vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1 inwait: 0, short wait event: 0 posted_run: 0 lastmodrngcnt: 0, lastmodrngcnt_loc: '(null)' lastmodrblcnt: 0, lastmodrblcnt_loc: '(null)' location where insched last set: kgskthrrun location where insched last cleared: kgskthrrun1 location where inwait last set: NULL location where inwait last cleared: NULL is_assigned: 0, in_sched: 0 (0) qcls: (nil), qlink: FALSE vt_active: 0 (pending: 0) vt_pq_active: 0, dop: 0, pq_servers (cur: 0 cg: 0) ps_allocs: 0, pxstmts (act: 0, done: 0 cg: 0) used quanta (usecs): stmt: 0, accum: 0, mapped: 0, tot: 0 exec start consumed time: 0 exec start elapsed time: 0 idle time: 0, active time: 0 (cg: 0) last updnumps: 0, active time (pq: 0 ps: 0) cpu yields: stmt: 0, accum: 0, mapped: 0, tot: 0 cpu waits: stmt: 0, accum: 0, mapped: 0, tot: 0 cpu wait time (usecs): stmt: 0, accum: 0, mapped: 0, tot: 0 ASL queued time outs: 0, time: 0 (cur 0, cg 0) PQQ queued time outs: 0, time: 0 (cur 0, cg 0) Queue timeout violation: 0 calls aborted: 0, num est exec limit hit: 0 undo current: 0k max: 0k KTU Session Commit Cache Dump for IDLs: KTU Session Commit Cache Dump for Non-IDLs: ---------------------------------------- KKS-UOL used : 0 locks(used=0, free=0) KGX Atomic Operation Log 0x85f5c8388 Mutex (nil)(0, 0) idn 0 oper NONE(0) Cursor Pin uid 5829 efd 7 whr 10 slp 0 KGX Atomic Operation Log 0x85f5c83d8 Mutex (nil)(0, 0) idn 0 oper NONE(0) FSO mutex uid 5829 efd 0 whr 0 slp 0 KGX Atomic Operation Log 0x85f5c8428 Mutex (nil)(0, 0) idn 0 oper NONE(0) FSO mutex uid 5829 efd 0 whr 0 slp 0 KGX Atomic Operation Log 0x85f5c8478 Mutex (nil)(0, 0) idn 0 oper NONE(0) FSO mutex uid 5829 efd 0 whr 0 slp 0 ---------------------------------------- KGL-UOL SO Cache(total=0, free=0) KGX Atomic Operation Log 0x83ff5cd70 Mutex 0x6fec4aa30(5829, 0) idn ccf13f6f oper EXCL(6) Library Cache uid 5829 efd 4 whr 148 slp 0 oper=0 pt1=0x6fec4a8e0 pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 Could not acquire mutex...Returning without dumping KGX Atomic Operation Log 0x83ff5cdd8 Mutex 0x6fec4a990(5829, 0) idn 35a7b2c oper EXCL(6) Library Cache uid 5829 efd 4 whr 101 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5ce40 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 65535 whr 134 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5cea8 Mutex 0x390c8e228(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 7 whr 163 slp 0 oper=0 pt1=0x390c8e0d8 pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5cf10 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 0 whr 0 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5cf78 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 0 whr 0 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5cfe0 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 0 whr 0 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5d048 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 0 whr 0 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000 KGX Atomic Operation Log 0x83ff5d0b0 Mutex (nil)(0, 0) idn 0 oper NONE(0) Library Cache uid 5829 efd 0 whr 0 slp 0 oper=0 pt1=(nil) pt2=(nil) pt3=(nil) pt4=(nil) pt5=(nil) ub4=0 flg=0x0 msk=0000-0000-0000-0000-0000
可以看到当时LCK1已经以EXCL模式持有了两个mutex,还有一句Could not acquire mutex…Returning without dumping 。没见过,看着像还在请求其他mutex,并且失败。
搜索mos发现与Bug 30384121 – lck process hang causing entire database to hang (Doc ID 30384121.8)非常匹配,包括关键的call stack信息。
@ I think that we are more likely to be looping in the kglHandleMessage() while @ loop as we might spin if kglobscfix_callback() cannot acquire the mutex. @ RELEASE NOTES: @ ]] It was possible for LCKx to block other sessions when it was trying to @ ]] acquire handle and dependency mutexes. This problem has now been @ ]] corrected. @ ]] @ . @ INTERNAL PROBLEM DESCRIPTION: @ Under rare conditions LCKx might repeatedly execute the following while @ loop @ because it is unable to get a handle mutex in nowait mode under @ kglobscfix_callback(); @ . @ while (kglScanDependents(gp, hd, kglobscfix_callback, (void *)0)); @ . @ and as we immediately re-get the dependency mutex under @ kglScanDependents->kglScanReferences but fail to get the handle mutex under @ kglobscfix_callback() we can block other sessions that try to acquire the @ dependency mutex but can give up and retry (as is the case for CLMN/PMON). @ @ . @ INTERNAL FIX DESCRIPTION: @ The fix is to amend the while loop so that we enter a small wait which @ should @ then allow blocked processes the ability to acquire the relevant mutex.
bug描述与trace发现的信息几乎完全匹配。基本可以确定就是这个bug引起的故障。
数据库的patch信息为:
oracle@gapdb1:/home/oracle> opatch lspatches 28865544; 29123189; 27751755; 25163866; 23333567; 28790651;OJVM RELEASE UPDATE: 12.2.0.1.190115 (28790651) 28870605;OCW JAN 2019 RELEASE UPDATE 12.2.0.1.190115 (28870605) 28822515;Database Jan 2019 Release Update : 12.2.0.1.190115 (28822515)
而该bug将在12.2的202007这个ru中修复。
over!!!
发表回复