今天一大早某客户在vx群反馈其中一套系统早上上班时间出现了一段时间的hung,后面自动恢复了。从客户提供的ASH数据来看,问题相对清晰:
Slot Time (Duration) | Slot Count | Event | Event Count | % Event |
---|---|---|---|---|
09:26:01 (2.0 min) | 56,948 | gc freelist | 42,070 | 9.28 |
09:26:01 (2.0 min)
|
56948
|
read by other session | 9,985 | 2.20 |
09:26:01 (2.0 min)
|
56948
|
enq: HW – contention | 3,332 | 0.73 |
09:28:00 (2.0 min) | 59,045 | gc freelist | 43,374 | 9.56 |
09:28:00 (2.0 min)
|
59045
|
read by other session | 10,188 | 2.25 |
09:28:00 (2.0 min)
|
59045
|
enq: HW – contention | 3,773 | 0.83 |
09:30:00 (2.0 min) | 59,432 | gc freelist | 43,156 | 9.52 |
09:30:00 (2.0 min)
|
59432
|
read by other session | 9,253 | 2.04 |
09:30:00 (2.0 min)
|
59432
|
enq: HW – contention | 5,238 | 1.16 |
09:32:00 (2.0 min) | 59,812 | gc freelist | 43,037 | 9.49 |
09:32:00 (2.0 min)
|
59812
|
read by other session | 9,296 | 2.05 |
09:32:00 (2.0 min)
|
59812
|
enq: HW – contention | 5,449 | 1.20 |
09:34:00 (2.0 min) | 60,775 | gc freelist | 43,675 | 9.63 |
09:34:00 (2.0 min)
|
60775
|
read by other session | 9,619 | 2.12 |
09:34:00 (2.0 min)
|
60775
|
enq: HW – contention | 5,428 | 1.20 |
09:36:00 (2.0 min) | 63,090 | gc freelist | 44,403 | 9.79 |
09:36:00 (2.0 min)
|
63090
|
read by other session | 11,003 | 2.43 |
09:36:00 (2.0 min)
|
63090
|
enq: HW – contention | 5,520 | 1.22 |
09:38:00 (2.0 min) | 63,576 | gc freelist | 44,636 | 9.84 |
09:38:00 (2.0 min)
|
63576
|
read by other session | 11,286 | 2.49 |
09:38:00 (2.0 min)
|
63576
|
enq: HW – contention | 5,474 | 1.21 |
09:40:00 (1.0 min) | 30,798 | gc freelist | 20,947 | 4.62 |
09:40:00 (1.0 min)
|
30798
|
read by other session | 5,530 | 1.22 |
09:40:00 (1.0 min)
|
30798
|
enq: HW – contention | 2,688 | 0.59 |
之前写过一篇关于gc freelist等待的文章。实际上该等待事件并不多见,如果当你发现top event中出现该等待,那么多半是遇到了Bug。
我们回到这个案例本身,实际上从ASH数据来看,无非就是gc freelist,enq:HW – contention,read by other session. 从直观感觉我认为可能大概率出问题在gc freelist,从而导致整个集群都非常缓慢,近似于hung的状态。当然,当其中一个节点都出现了大量gc freelist,导致整个系统很慢的情况下,那么另外一个节点ash 势必也是大量的锁等待。
Slot Time (Duration) | Slot Count | Event | Event Count | % Event |
---|---|---|---|---|
09:22:36 (1.4 min) | 6,506 | enq: HW – contention | 2,859 | 3.02 |
09:22:36 (1.4 min)
|
6506
|
enq: TX – row lock contention | 1,380 | 1.46 |
09:22:36 (1.4 min)
|
6506
|
gc buffer busy acquire | 1,128 | 1.19 |
09:24:00 (2.0 min) | 10,361 | enq: HW – contention | 4,266 | 4.51 |
09:24:00 (2.0 min)
|
10361
|
enq: TX – index contention | 2,679 | 2.83 |
09:24:00 (2.0 min)
|
10361
|
enq: TX – row lock contention | 2,452 | 2.59 |
09:26:00 (2.0 min) | 11,352 | enq: HW – contention | 4,550 | 4.81 |
09:26:00 (2.0 min)
|
11352
|
enq: TX – index contention | 3,030 | 3.20 |
09:26:00 (2.0 min)
|
11352
|
enq: TX – row lock contention | 2,802 | 2.96 |
09:28:00 (2.0 min) | 12,415 | enq: HW – contention | 5,000 | 5.29 |
09:28:00 (2.0 min)
|
12415
|
enq: TX – index contention | 3,411 | 3.61 |
09:28:00 (2.0 min)
|
12415
|
enq: TX – row lock contention | 3,078 | 3.25 |
09:30:00 (2.0 min) | 13,300 | enq: HW – contention | 5,271 | 5.57 |
09:30:00 (2.0 min)
|
13300
|
enq: TX – index contention | 3,184 | 3.37 |
09:30:00 (2.0 min)
|
13300
|
enq: TX – row lock contention | 3,079 | 3.25 |
09:32:00 (2.0 min) | 13,638 | enq: HW – contention | 5,423 | 5.73 |
09:32:00 (2.0 min)
|
13638
|
enq: TX – row lock contention | 3,296 | 3.48 |
09:32:00 (2.0 min)
|
13638
|
gc buffer busy acquire | 3,216 | 3.40 |
09:34:00 (2.0 min) | 14,283 | enq: HW – contention | 5,647 | 5.97 |
09:34:00 (2.0 min)
|
14283
|
enq: TX – row lock contention | 3,678 | 3.89 |
09:34:00 (2.0 min)
|
14283
|
gc buffer busy acquire | 3,307 | 3.50 |
09:36:00 (1.7 min) | 12,751 | enq: HW – contention | 5,083 | 5.37 |
09:36:00 (1.7 min)
|
12751
|
enq: TX – row lock contention | 3,327 | 3.52 |
09:36:00 (1.7 min)
|
12751
|
gc buffer busy acquire | 3,108 | 3.29 |
因此这个时候看ash已经没有什么用了。从目前判断问题主要在节点2上,因此我来看下客户提供的节点2故障时间段内的一个diag trace:
SO: 0x3b02491fd8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3b02491fd8, name=process, file=ksu.h LINE:12721, pg=0 (process) Oracle pid:903, ser:142, calls cur/top: 0x27a2ded4f8/0x27a874fa70 flags : (0x0) - flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 184 0 2 last post received-location: ksl2.h LINE:2374 ID:kslpsr last process to post me: 0x39c2424aa0 1 6 last post sent: 0 0 183 last post sent-location: kcl2.h LINE:4436 ID:kclnfndnewm last process posted by me: 0x3b023aa048 1 6 (latch info) wait_event=184 bits=0x0 Process Group: DEFAULT, pseudo proc: 0x39c262fe68 O/S info: user: grid, term: UNKNOWN, ospid: 52182 OSD pid info: Unix process pid: 52182, image: oracle@xxxxxdb2 ---------------------------------------- SO: 0x3a0ca21218, type: 10, owner: 0x3b02491fd8, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x3b02491fd8, name=FileOpenBlock, file=ksfd.h LINE:6688, pg=0 (FOB) 0x3a0ca21218 flags=2560 fib=0x39cc6c67a8 incno=8 pending i/o cnt=0 fname=+DATA/ythdb/datafile/system.256.1048801851 fno=1 lblksz=8192 fsiz=651520 ---------------------------------------- SO: 0x39e29be138, type: 4, owner: 0x3b02491fd8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3b02491fd8, name=session, file=ksu.h LINE:12729, pg=0 (session) sid: 1254 ser: 26125 trans: 0x387e16b0b0, creator: 0x3b02491fd8 flags: (0x100041) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC DID: , short-term DID: txn branch: (nil) edition#: 0 oct: 2, prv: 0, sql: 0x3a9dc80b10, psql: 0x3a9dc80b10, user: 0/SYS ksuxds FALSE at location: 0 service name: ythdb client details: O/S info: user: root, term: unknown, ospid: 1234 machine: gfmis-pay-sendglsaccount-85b456d76c-zzctl program: JDBC Thin Client application name: JDBC Thin Client, hash value=2546894660 Current Wait Stack: 0: waiting for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4839 seq_num=4840 snap_id=1 wait times: snap=0.001568 sec, exc=0.001568 sec, total=0.001568 sec wait times: max=0.200000 sec, heur=16 min 8 sec wait counts: calls=1 os=1 in_wait=1 iflags=0x5a8 There are 64 sessions blocked by this session. Dumping one waiter: inst: 2, sid: 1289, ser: 25931 wait event: 'enq: HW - contention' p1: 'name|mode'=0x48570006 p2: 'table space #'=0x0 p3: 'block'=0x400978 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 943 secs, waiter_cache_ver: 38771 Wait State: fixed_waits=0 flags=0x22 boundary=(nil)/-1 Session Wait History: elapsed time of 0.000100 sec since current wait 0: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4838 seq_num=4839 snap_id=1 wait times: snap=0.017854 sec, exc=0.017854 sec, total=0.017854 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000090 sec of elapsed time 1: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4837 seq_num=4838 snap_id=1 wait times: snap=0.200893 sec, exc=0.200893 sec, total=0.200893 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000105 sec of elapsed time 2: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4836 seq_num=4837 snap_id=1 wait times: snap=0.200923 sec, exc=0.200923 sec, total=0.200923 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000078 sec of elapsed time 3: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4835 seq_num=4836 snap_id=1 wait times: snap=0.200908 sec, exc=0.200908 sec, total=0.200908 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000092 sec of elapsed time 4: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4834 seq_num=4835 snap_id=1 wait times: snap=0.200906 sec, exc=0.200906 sec, total=0.200906 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000094 sec of elapsed time 5: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4833 seq_num=4834 snap_id=1 wait times: snap=0.200900 sec, exc=0.200900 sec, total=0.200900 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000092 sec of elapsed time 6: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4832 seq_num=4833 snap_id=1 wait times: snap=0.200918 sec, exc=0.200918 sec, total=0.200918 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000088 sec of elapsed time 7: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4831 seq_num=4832 snap_id=1 wait times: snap=0.200907 sec, exc=0.200907 sec, total=0.200907 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000094 sec of elapsed time 8: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4830 seq_num=4831 snap_id=1 wait times: snap=0.200883 sec, exc=0.200883 sec, total=0.200883 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000113 sec of elapsed time 9: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4829 seq_num=4830 snap_id=1 wait times: snap=0.200897 sec, exc=0.200897 sec, total=0.200897 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000103 sec of elapsed time
我们可以清楚的看到,实际上大量session被该会话所阻塞,而该session在等待gc freelist,而且已经等待了10多分钟。
进一步分析该session的trace,可以看到执行情况,发现跟审计有关:
User=0x39e29be138 Session=0x39e29be138 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=637acf85 LibraryHandle: Address=0x3a9dc80b10 Hash=367d06a6 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD ObjectName: Name=insert into sys.aud$( sessionid,entryid,statement,ntimestamp#, userid,userhost,terminal,action#,returncode, obj$creator,obj$name,auth$privileges,auth$grantee, new$owner,new$name,ses$actions,ses$tid,logoff$pread, logoff$lwrite,logoff$dead,comment$text,spare1,spare2, priv$used,clientid,sessioncpu,proxy$sid,user$guid, instance#,process#,xid,scn,auditid, sqlbind,sqltext,obj$edition,dbid) values(:1,:2,:3,SYS_EXTRACT_UTC(SYSTIMESTAMP), :4,:5,:6,:7,:8, :9,:10,:11,:12, :13,:14,:15,:16,:17, :18,:1 FullHashValue=4ba8a6dafb69204e4de1216b367d06a6 Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=914163366 OwnerIdn=0 Statistics: InvalidationCount=0 ExecutionCount=299986 LoadCount=20 ActiveLocks=869 TotalLockCount=299437 TotalPinCount=2 Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=2 Version=0 BucketInUse=37261 HandleInUse=37261 HandleReferenceCount=0
从现象来看,跟Oracle mos文章 INSERT INTO SYS.AUD$ Is Blocking Due to Waiting on ‘gc freelist’ (Doc ID 2207798.1) 描述很像的。
同时也参考了几篇类似文章例如Unified And/Or Traditional Auditing Inserts Waiting On GC FREELIST (In ASSM Tablespace) (Doc ID 2596033.1)
总的来讲,建议是调整如下参数:_gc_element_percent 建议设置为140~200之间(默认值是110).
当然,根据本文这里的case,实际上也可以通过move aud$到其他表空间来解决或关闭审计来绕过这个问题。
发表回复