案例:troubleshooting LCK PROCESS HANG CAUSING ENTIRE DATABASE TO HANG

本案例来自西区某客户,数据库版本为12.2,6节点rac,当时应用反应异常缓慢。十万火急,需要立马解决。当我接手时,故障已经持续了3个小时了。

当时查询gv$session时,发现大量会话的final blocking session都是6节点的LCK1进程,该进程等待事件为libcache interrupt action by LCK。由于时间紧急,果断的abort掉了6节点的实例,数据库立马恢复了正常。

当分析原因的时候,发现当时数据库hang得非常严重,连ash和awr都没有产生。但是从diag/dia0/lmhb trace中可以发现一些有价值的信息。

首先LCK是rac非常核心的进程,用于同步GES层面的一些实例级别的lock信息,比如row cache、library cache。而lmbh正好会监控这些核心rac核心进程。

从lmhb trace可以看到:

*** 2022-07-31T21:10:01.053896+08:00 (CDB$ROOT(1))
==============================
LCK1 (ospid: 52025) has not moved for 14 sec (1659273000.1659272986)
...
...
*** 2022-07-31T23:25:32.096842+08:00 (CDB$ROOT(1))
==============================
LCK1 (ospid: 52025) has not moved for 8146 sec (1659281132.1659272986)

LCK1进程已经hang住了,并且持续了很久。并且lmhb中还记录了LCK1的process信息,包含了call stack。

*** 2022-07-31T21:10:11.326435+08:00 (CDB$ROOT(1))
==============================
LCK1 (ospid: 52025) has not moved for 24 sec (1659273010.1659272986)
  : heartbeat check status 2 (acceptable) (threshold 70 sec)
  : heartbeat poke time 0x62e67f24 req 0x120ca4 ack 0x120ca3 freq 10
  : heartbeat state 0x41.02 (libcache) pso-flag 0x0
  : waiting for event 'libcache interrupt action by LCK' for 19 secs with wait_id 841902246.
  ===[ Wait Chain ]===
  Wait chain is empty.
  ==============================
  Dumping PROCESS LCK1 (ospid: 52025) States
  ==============================
  ===[ Callstack ]===

*** 2022-07-31T21:10:11.326637+08:00 (CDB$ROOT(1))
Process diagnostic dump for oracle@gapdb4 (LCK1), OS id=52025,
pid: 31, proc_ser: 1, sid: 5829, sess_ser: 13529
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
  0.000000 secs at [ 21:10:10 ]
    NOTE: scheduling delay has not been sampled for 0.807660 secs
  0.000000 secs from [ 21:10:06 - 21:10:11 ], 5 sec avg
  0.000000 secs from [ 21:09:12 - 21:10:11 ], 1 min avg
  0.000000 secs from [ 21:05:11 - 21:10:11 ], 5 min avg

*** 2022-07-31T21:10:13.735792+08:00 (CDB$ROOT(1))
loadavg : 8.68 7.66 7.18
System user time: 0.11 sys time: 0.01 context switch: 89486
Memory (Avail / Total) = 11748.10M / 515601.66M
Swap (Avail / Total) = 49147.25M /  49152.00M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 R oracle   52025     1  2  80   0 - 51757151 -   Mar16 ?        2-18:57:53 ora_lck1_paascdba6
Name:	ora_lck1_paascd
State:	R (running)
Tgid:	52025
Ngid:	0
Pid:	52025
PPid:	1
TracerPid:	0
Uid:	1100	1100	1100	1100
Gid:	1101	1100	1100	1100
FDSize:	64
Groups:	1100 1101
NStgid:	52025
NSpid:	52025
NSpgid:	52025
NSsid:	52025
VmPeak:	207028604 kB
VmSize:	207028604 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	   67848 kB
VmRSS:	   67784 kB
VmData:	    6264 kB
VmStk:	     152 kB
VmExe:	  343608 kB
VmLib:	   33008 kB
VmPTE:	     568 kB
VmPMD:	     168 kB
VmSwap:	       0 kB
Threads:	1
SigQ:	0/2062272
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000016400207
SigCgt:	00000003c9887cf8
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000003fffffffff
Seccomp:	0
Speculation_Store_Bypass:	thread vulnerable
Cpus_allowed:	ffff,ffffffff,ffffffff
Cpus_allowed_list:	0-79
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000000f
Mems_allowed_list:	0-3
voluntary_ctxt_switches:	4999124562
nonvoluntary_ctxt_switches:	52927652
Short stack dump:
ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-kglobscfix_callback()+64<-kglScanReferences()+222<-kglHandleMessage()+118<-kqlmbivg_cbk()+1647<-kqlmbivg()+391<-kqlmba()+513<-ksbcti()+247<-ksbabs()+551<-ksbrdp()+1079<-opirip()+609<-opidrv()+602<-sou2o()+145<-opimai_real()+202<-ssthrdmain()+417<-main()+262<-__libc_start_main()+245

从loadavg,包括当时cpu使用率,内存使用率来看,当时负载其实很低,并且LCK1进程处于R状态。call stack为:kglobscfix_callback()+64<-kglScanReferences()+222<-kglHandleMessage()+118<-kqlmbivg_cbk()+1647<-kqlmbivg()+391<-kqlmba()。负载很低的情况下,LCK1 hang住那么很可能是一个bug。

分析diag trace的systemtate dump,可以找到LCK1进程的session state dump信息。

    SO: 0x9e4987468, type: 4, owner: 0xa03c72a88, flag: INIT/-/-/-/0x00 i f: 0x3 c: 0x3
proc=0xa03c72a88, name=session, file=ksu.h LINE:15737, pg=0, conuid=1
SGA version=(1,0)
(session) sid: 5829 ser: 13529 trans: (nil), creator: 0xa03c72a88
flags: (0x51) USR/- flags2: (0x409) -/-/INC
flags_idl: (0x1) status: BSY/-/-/- kill: -/-/-/-
0000-0000-000000000000-0000-00000000              DID: , short-term DID:
txn branch: (nil)
con_id/con_uid/con_name: 1/1/CDB$ROOT
con_logonuid: 1 con_logonid: 1
con_scuid: 1 con_scid: 1
edition#: 0              user#/name: 0/SYS
oct: 0, prv: 0, sql: (nil), psql: (nil)
stats: 0x97fe0b988, PX stats: 0x110e9744
service name: SYS$BACKGROUND
Current Wait Stack:
0: waiting for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902246 seq_num=14388 snap_id=1
wait times: snap=16 min 2 sec, exc=16 min 2 sec, total=16 min 2 sec
wait times: max=infinite, heur=16 min 2 sec
wait counts: calls=0 os=0
in_wait=1 iflags=0x5a2
There are 565 sessions blocked by this session.
Dumping one waiter:
inst: 1, sid: 13726, ser: 15144
wait event: 'enq: IV -  contention'
p1: 'type|mode'=0x49560005
p2: 'id1'=0x53594e43
p3: 'id2'=0x80
row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
min_blocked_time: 949 secs, waiter_cache_ver: 51635
Wait State:
fixed_waits=0 flags=0x22 boundary=(nil)/-1
Session Wait History:
elapsed time of 0.000002 sec since current wait
0: waited for 'rdbms ipc message'
timeout=0xfd, =0x0, =0x0
wait_id=841902245 seq_num=14387 snap_id=1
wait times: snap=0.000318 sec, exc=0.000318 sec, total=0.000318 sec
wait times: max=2.530000 sec
wait counts: calls=1 os=1
occurred after 0.000001 sec of elapsed time
1: waited for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902244 seq_num=14386 snap_id=1
wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000002 sec of elapsed time
2: waited for 'rdbms ipc message'
timeout=0xfd, =0x0, =0x0
wait_id=841902243 seq_num=14385 snap_id=1
wait times: snap=0.000146 sec, exc=0.000146 sec, total=0.000146 sec
wait times: max=2.530000 sec
wait counts: calls=1 os=1
occurred after 0.000002 sec of elapsed time
3: waited for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902241 seq_num=14384 snap_id=2
wait times: snap=0.000048 sec, exc=0.000079 sec, total=0.000363 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000000 sec of elapsed time
4: waited for 'enq: IV -  contention'
type|mode=0x49560003, id1=0x53594e43, id2=0x80
wait_id=841902242 seq_num=14383 snap_id=1
wait times: snap=0.000284 sec, exc=0.000284 sec, total=0.000284 sec
wait times: max=infinite
wait counts: calls=3 os=3
occurred after 0.000000 sec of elapsed time
5: waited for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902241 seq_num=14382 snap_id=1
wait times: snap=0.000031 sec, exc=0.000031 sec, total=0.000031 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000002 sec of elapsed time
6: waited for 'rdbms ipc message'
timeout=0xfd, =0x0, =0x0
wait_id=841902240 seq_num=14381 snap_id=1
wait times: snap=0.000436 sec, exc=0.000436 sec, total=0.000436 sec
wait times: max=2.530000 sec
wait counts: calls=1 os=1
occurred after 0.000001 sec of elapsed time
7: waited for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902239 seq_num=14380 snap_id=1
wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000002 sec of elapsed time
8: waited for 'rdbms ipc message'
timeout=0xfd, =0x0, =0x0
wait_id=841902238 seq_num=14379 snap_id=1
wait times: snap=0.000097 sec, exc=0.000097 sec, total=0.000097 sec
wait times: max=2.530000 sec
wait counts: calls=1 os=1
occurred after 0.000001 sec of elapsed time
9: waited for 'libcache interrupt action by LCK'
location=0x1, =0x0, =0x0
wait_id=841902237 seq_num=14378 snap_id=1
wait times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec
wait times: max=infinite
wait counts: calls=0 os=0
occurred after 0.000002 sec of elapsed time
Sampled Session History of session 5829 serial 13529
---------------------------------------------------
The sampled session history is constructed by sampling
the target session every 1 second. The sampling process
captures at each sample if the session is in a non-idle wait,
an idle wait, or not in a wait. If the session is in a
non-idle wait then one interval is shown for all the samples
the session was in the same non-idle wait. If the
session is in an idle wait or not in a wait for
consecutive samples then one interval is shown for all
the consecutive samples. Though we display these consecutive
samples  in a single interval the session may NOT be continuously
idle or not in a wait (the sampling process does not know).
The history is displayed in reverse chronological order.
sample interval: 1 sec, max history 120 sec
---------------------------------------------------
[120 samples,                                        21:23:54 - 21:25:54]
waited for 'libcache interrupt action by LCK', seq_num: 14388
p1: 'location'=0x1
p2: ''=0x0
p3: ''=0x0
time_waited: >= 120 sec (still in wait)
---------------------------------------------------
Sampled Session History Summary:
longest_non_idle_wait: 'libcache interrupt action by LCK'
[120 samples, 21:23:54 - 21:25:54]
time_waited: >= 120 sec (still in wait)
---------------------------------------------------
----------------------------------------
kpoxc: mode=0 static=0 inReq=0 pendErrInRep=0 callUseSCN=0 nonReq=0
kpoxc:  trace=0 roundTrip=0 errCode=0 ctl=0 eocs=0 disErr=0 dscnEnabled=0
temporary object counter: 0
last pdbid to increment: 0, last function to increment: (null)
----------------------------------------
Virtual Thread:
kgskvt: 0x9e9369820, sess: 0x9e4987468/1 sid: 5829 ser: 13529
vc: (nil), proc: 0xa03c72a88, id: 5829
consumer group cur: _ORACLE_BACKGROUND_GROUP_/1 (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig:
vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1
inwait: 0, short wait event: 0 posted_run: 0
lastmodrngcnt: 0, lastmodrngcnt_loc: '(null)'
lastmodrblcnt: 0, lastmodrblcnt_loc: '(null)'
location where insched last set: kgskthrrun
location where insched last cleared: kgskthrrun1
location where inwait last set: NULL
location where inwait last cleared: NULL
is_assigned: 0, in_sched: 0 (0)
qcls: (nil), qlink: FALSE
vt_active: 0 (pending: 0)
vt_pq_active: 0, dop: 0, pq_servers (cur: 0 cg: 0)
ps_allocs: 0, pxstmts (act: 0, done: 0 cg: 0)
used quanta (usecs):
stmt: 0, accum: 0, mapped: 0, tot: 0
exec start consumed time: 0
exec start elapsed time: 0
idle time: 0, active time: 0 (cg: 0)
last updnumps: 0, active time (pq: 0 ps: 0)
cpu yields:
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu waits:
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu wait time (usecs):
stmt: 0, accum: 0, mapped: 0, tot: 0
ASL queued time outs: 0, time: 0 (cur 0, cg 0)
PQQ queued time outs: 0, time: 0 (cur 0, cg 0)
Queue timeout violation: 0
calls aborted: 0, num est exec limit hit: 0
undo current: 0k max: 0k
KTU Session Commit Cache Dump for IDLs:
KTU Session Commit Cache Dump for Non-IDLs:
----------------------------------------
KKS-UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log 0x85f5c8388
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Cursor Pin uid 5829 efd 7 whr 10 slp 0
KGX Atomic Operation Log 0x85f5c83d8
Mutex (nil)(0, 0) idn 0 oper NONE(0)
FSO mutex uid 5829 efd 0 whr 0 slp 0
KGX Atomic Operation Log 0x85f5c8428
Mutex (nil)(0, 0) idn 0 oper NONE(0)
FSO mutex uid 5829 efd 0 whr 0 slp 0
KGX Atomic Operation Log 0x85f5c8478
Mutex (nil)(0, 0) idn 0 oper NONE(0)
FSO mutex uid 5829 efd 0 whr 0 slp 0
----------------------------------------
KGL-UOL SO Cache(total=0, free=0)
KGX Atomic Operation Log 0x83ff5cd70
Mutex 0x6fec4aa30(5829, 0) idn ccf13f6f oper EXCL(6)
Library Cache uid 5829 efd 4 whr 148 slp 0
oper=0 pt1=0x6fec4a8e0 pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
Could not acquire mutex...Returning without dumping
KGX Atomic Operation Log 0x83ff5cdd8
Mutex 0x6fec4a990(5829, 0) idn 35a7b2c oper EXCL(6)
Library Cache uid 5829 efd 4 whr 101 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5ce40
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 65535 whr 134 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5cea8
Mutex 0x390c8e228(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 7 whr 163 slp 0
oper=0 pt1=0x390c8e0d8 pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5cf10
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 0 whr 0 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5cf78
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 0 whr 0 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5cfe0
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 0 whr 0 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5d048
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 0 whr 0 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000
KGX Atomic Operation Log 0x83ff5d0b0
Mutex (nil)(0, 0) idn 0 oper NONE(0)
Library Cache uid 5829 efd 0 whr 0 slp 0
oper=0 pt1=(nil) pt2=(nil) pt3=(nil)
pt4=(nil) pt5=(nil) ub4=0 flg=0x0
msk=0000-0000-0000-0000-0000

可以看到当时LCK1已经以EXCL模式持有了两个mutex,还有一句Could not acquire mutex…Returning without dumping 。没见过,看着像还在请求其他mutex,并且失败。

搜索mos发现与Bug 30384121 – lck process hang causing entire database to hang (Doc ID 30384121.8)非常匹配,包括关键的call stack信息。

@ I think that we are more likely to be looping in the kglHandleMessage() while
@ loop as we might spin if kglobscfix_callback() cannot acquire the mutex.
@ RELEASE NOTES:
@ ]] It was possible for LCKx to block other sessions when it was trying to
@ ]] acquire handle and dependency mutexes. This problem has now been
@ ]] corrected.
@ ]]
@ .
@ INTERNAL PROBLEM DESCRIPTION:
@ Under rare conditions LCKx might repeatedly execute the following while
@ loop
@ because it is unable to get a handle mutex in nowait mode under
@ kglobscfix_callback();
@ .
@ while (kglScanDependents(gp, hd, kglobscfix_callback, (void *)0));
@ .
@ and as we immediately re-get the dependency mutex under
@ kglScanDependents->kglScanReferences but fail to get the handle mutex under
@ kglobscfix_callback() we can block other sessions that try to acquire the
@ dependency mutex but can give up and retry (as is the case for CLMN/PMON).
@
@ .
@ INTERNAL FIX DESCRIPTION:
@ The fix is to amend the while loop so that we enter a small wait which
@ should
@ then allow blocked processes the ability to acquire the relevant mutex.

bug描述与trace发现的信息几乎完全匹配。基本可以确定就是这个bug引起的故障。

数据库的patch信息为:

oracle@gapdb1:/home/oracle> opatch lspatches
28865544;
29123189;
27751755;
25163866;
23333567;
28790651;OJVM RELEASE UPDATE: 12.2.0.1.190115 (28790651)
28870605;OCW JAN 2019 RELEASE UPDATE 12.2.0.1.190115 (28870605)
28822515;Database Jan 2019 Release Update : 12.2.0.1.190115 (28822515)

而该bug将在12.2的202007这个ru中修复。

over!!!

 

 

 

 


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注