gc freelist问题再现

今天一大早某客户在vx群反馈其中一套系统早上上班时间出现了一段时间的hung,后面自动恢复了。从客户提供的ASH数据来看,问题相对清晰:

Slot Time (Duration) Slot Count Event Event Count % Event
09:26:01 (2.0 min) 56,948 gc freelist 42,070 9.28
read by other session 9,985 2.20
enq: HW – contention 3,332 0.73
09:28:00 (2.0 min) 59,045 gc freelist 43,374 9.56
read by other session 10,188 2.25
enq: HW – contention 3,773 0.83
09:30:00 (2.0 min) 59,432 gc freelist 43,156 9.52
read by other session 9,253 2.04
enq: HW – contention 5,238 1.16
09:32:00 (2.0 min) 59,812 gc freelist 43,037 9.49
read by other session 9,296 2.05
enq: HW – contention 5,449 1.20
09:34:00 (2.0 min) 60,775 gc freelist 43,675 9.63
read by other session 9,619 2.12
enq: HW – contention 5,428 1.20
09:36:00 (2.0 min) 63,090 gc freelist 44,403 9.79
read by other session 11,003 2.43
enq: HW – contention 5,520 1.22
09:38:00 (2.0 min) 63,576 gc freelist 44,636 9.84
read by other session 11,286 2.49
enq: HW – contention 5,474 1.21
09:40:00 (1.0 min) 30,798 gc freelist 20,947 4.62
read by other session 5,530 1.22
enq: HW – contention 2,688 0.59

之前写过一篇关于gc freelist等待的文章。实际上该等待事件并不多见,如果当你发现top event中出现该等待,那么多半是遇到了Bug。

我们回到这个案例本身,实际上从ASH数据来看,无非就是gc freelist,enq:HW – contention,read by other session. 从直观感觉我认为可能大概率出问题在gc freelist,从而导致整个集群都非常缓慢,近似于hung的状态。当然,当其中一个节点都出现了大量gc freelist,导致整个系统很慢的情况下,那么另外一个节点ash 势必也是大量的锁等待。

Slot Time (Duration) Slot Count Event Event Count % Event
09:22:36 (1.4 min) 6,506 enq: HW – contention 2,859 3.02
enq: TX – row lock contention 1,380 1.46
gc buffer busy acquire 1,128 1.19
09:24:00 (2.0 min) 10,361 enq: HW – contention 4,266 4.51
enq: TX – index contention 2,679 2.83
enq: TX – row lock contention 2,452 2.59
09:26:00 (2.0 min) 11,352 enq: HW – contention 4,550 4.81
enq: TX – index contention 3,030 3.20
enq: TX – row lock contention 2,802 2.96
09:28:00 (2.0 min) 12,415 enq: HW – contention 5,000 5.29
enq: TX – index contention 3,411 3.61
enq: TX – row lock contention 3,078 3.25
09:30:00 (2.0 min) 13,300 enq: HW – contention 5,271 5.57
enq: TX – index contention 3,184 3.37
enq: TX – row lock contention 3,079 3.25
09:32:00 (2.0 min) 13,638 enq: HW – contention 5,423 5.73
enq: TX – row lock contention 3,296 3.48
gc buffer busy acquire 3,216 3.40
09:34:00 (2.0 min) 14,283 enq: HW – contention 5,647 5.97
enq: TX – row lock contention 3,678 3.89
gc buffer busy acquire 3,307 3.50
09:36:00 (1.7 min) 12,751 enq: HW – contention 5,083 5.37
enq: TX – row lock contention 3,327 3.52
gc buffer busy acquire 3,108 3.29

因此这个时候看ash已经没有什么用了。从目前判断问题主要在节点2上,因此我来看下客户提供的节点2故障时间段内的一个diag trace:

  SO: 0x3b02491fd8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x3b02491fd8, name=process, file=ksu.h LINE:12721, pg=0
  (process) Oracle pid:903, ser:142, calls cur/top: 0x27a2ded4f8/0x27a874fa70
            flags : (0x0) -
            flags2: (0x0),  flags3: (0x10)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 184 0 2
              last post received-location: ksl2.h LINE:2374 ID:kslpsr
              last process to post me: 0x39c2424aa0 1 6
              last post sent: 0 0 183
              last post sent-location: kcl2.h LINE:4436 ID:kclnfndnewm
              last process posted by me: 0x3b023aa048 1 6
    (latch info) wait_event=184 bits=0x0
    Process Group: DEFAULT, pseudo proc: 0x39c262fe68
    O/S info: user: grid, term: UNKNOWN, ospid: 52182
    OSD pid info: Unix process pid: 52182, image: oracle@xxxxxdb2
    ----------------------------------------
    SO: 0x3a0ca21218, type: 10, owner: 0x3b02491fd8, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
     proc=0x3b02491fd8, name=FileOpenBlock, file=ksfd.h LINE:6688, pg=0
    (FOB) 0x3a0ca21218 flags=2560 fib=0x39cc6c67a8 incno=8 pending i/o cnt=0
     fname=+DATA/ythdb/datafile/system.256.1048801851
     fno=1 lblksz=8192 fsiz=651520
    ----------------------------------------
    SO: 0x39e29be138, type: 4, owner: 0x3b02491fd8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x3b02491fd8, name=session, file=ksu.h LINE:12729, pg=0
    (session) sid: 1254 ser: 26125 trans: 0x387e16b0b0, creator: 0x3b02491fd8
              flags: (0x100041) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x40009) -/-/INC
              DID: , short-term DID:
              txn branch: (nil)
              edition#: 0              oct: 2, prv: 0, sql: 0x3a9dc80b10, psql: 0x3a9dc80b10, user: 0/SYS
    ksuxds FALSE at location: 0
    service name: ythdb
    client details:
      O/S info: user: root, term: unknown, ospid: 1234
      machine: gfmis-pay-sendglsaccount-85b456d76c-zzctl program: JDBC Thin Client
      application name: JDBC Thin Client, hash value=2546894660
    Current Wait Stack:
     0: waiting for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4839 seq_num=4840 snap_id=1
        wait times: snap=0.001568 sec, exc=0.001568 sec, total=0.001568 sec
        wait times: max=0.200000 sec, heur=16 min 8 sec
        wait counts: calls=1 os=1
        in_wait=1 iflags=0x5a8
    There are 64 sessions blocked by this session.
    Dumping one waiter:
      inst: 2, sid: 1289, ser: 25931
      wait event: 'enq: HW - contention'
        p1: 'name|mode'=0x48570006
        p2: 'table space #'=0x0
        p3: 'block'=0x400978
      row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
      min_blocked_time: 943 secs, waiter_cache_ver: 38771
    Wait State:
      fixed_waits=0 flags=0x22 boundary=(nil)/-1
    Session Wait History:
        elapsed time of 0.000100 sec since current wait
     0: waited for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4838 seq_num=4839 snap_id=1
        wait times: snap=0.017854 sec, exc=0.017854 sec, total=0.017854 sec
        wait times: max=0.200000 sec
        wait counts: calls=1 os=1
        occurred after 0.000090 sec of elapsed time
     1: waited for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4837 seq_num=4838 snap_id=1
        wait times: snap=0.200893 sec, exc=0.200893 sec, total=0.200893 sec
        wait times: max=0.200000 sec
        wait counts: calls=1 os=1
        occurred after 0.000105 sec of elapsed time
     2: waited for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4836 seq_num=4837 snap_id=1
        wait times: snap=0.200923 sec, exc=0.200923 sec, total=0.200923 sec
        wait times: max=0.200000 sec
        wait counts: calls=1 os=1
        occurred after 0.000078 sec of elapsed time
     3: waited for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4835 seq_num=4836 snap_id=1
        wait times: snap=0.200908 sec, exc=0.200908 sec, total=0.200908 sec
        wait times: max=0.200000 sec
        wait counts: calls=1 os=1
        occurred after 0.000092 sec of elapsed time
     4: waited for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4834 seq_num=4835 snap_id=1
        wait times: snap=0.200906 sec, exc=0.200906 sec, total=0.200906 sec
        wait times: max=0.200000 sec
        wait counts: calls=1 os=1
        occurred after 0.000094 sec of elapsed time
     5: waited for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4833 seq_num=4834 snap_id=1
        wait times: snap=0.200900 sec, exc=0.200900 sec, total=0.200900 sec
        wait times: max=0.200000 sec
        wait counts: calls=1 os=1
        occurred after 0.000092 sec of elapsed time
     6: waited for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4832 seq_num=4833 snap_id=1
        wait times: snap=0.200918 sec, exc=0.200918 sec, total=0.200918 sec
        wait times: max=0.200000 sec
        wait counts: calls=1 os=1
        occurred after 0.000088 sec of elapsed time
     7: waited for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4831 seq_num=4832 snap_id=1
        wait times: snap=0.200907 sec, exc=0.200907 sec, total=0.200907 sec
        wait times: max=0.200000 sec
        wait counts: calls=1 os=1
        occurred after 0.000094 sec of elapsed time
     8: waited for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4830 seq_num=4831 snap_id=1
        wait times: snap=0.200883 sec, exc=0.200883 sec, total=0.200883 sec
        wait times: max=0.200000 sec
        wait counts: calls=1 os=1
        occurred after 0.000113 sec of elapsed time
     9: waited for 'gc freelist'
        =0x0, =0x0, =0x0
        wait_id=4829 seq_num=4830 snap_id=1
        wait times: snap=0.200897 sec, exc=0.200897 sec, total=0.200897 sec
        wait times: max=0.200000 sec
        wait counts: calls=1 os=1
        occurred after 0.000103 sec of elapsed time

我们可以清楚的看到,实际上大量session被该会话所阻塞,而该session在等待gc freelist,而且已经等待了10多分钟。

进一步分析该session的trace,可以看到执行情况,发现跟审计有关:

User=0x39e29be138 Session=0x39e29be138 ReferenceCount=1
Flags=CNB/[0001] SavepointNum=637acf85
      LibraryHandle:  Address=0x3a9dc80b10 Hash=367d06a6 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
        ObjectName:  Name=insert into sys.aud$( sessionid,entryid,statement,ntimestamp#, userid,userhost,terminal,action#,returncode, obj$creator,obj$name,auth$privileges,auth$grantee, new$owner,new$name,ses$actions,ses$tid,logoff$pread, logoff$lwrite,logoff$dead,comment$text,spare1,spare2,  priv$used,clientid,sessioncpu,proxy$sid,user$guid, instance#,process#,xid,scn,auditid,  sqlbind,sqltext,obj$edition,dbid)  values(:1,:2,:3,SYS_EXTRACT_UTC(SYSTIMESTAMP),     :4,:5,:6,:7,:8,     :9,:10,:11,:12,     :13,:14,:15,:16,:17,     :18,:1
          FullHashValue=4ba8a6dafb69204e4de1216b367d06a6 Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=914163366 OwnerIdn=0
        Statistics:  InvalidationCount=0 ExecutionCount=299986 LoadCount=20 ActiveLocks=869 TotalLockCount=299437 TotalPinCount=2
        Counters:  BrokenCount=1 RevocablePointer=1 KeepDependency=2 Version=0 BucketInUse=37261 HandleInUse=37261 HandleReferenceCount=0

从现象来看,跟Oracle mos文章 INSERT INTO SYS.AUD$ Is Blocking Due to Waiting on ‘gc freelist’ (Doc ID 2207798.1) 描述很像的。

同时也参考了几篇类似文章例如Unified And/Or Traditional Auditing Inserts Waiting On GC FREELIST (In ASSM Tablespace) (Doc ID 2596033.1)

总的来讲,建议是调整如下参数:_gc_element_percent  建议设置为140~200之间(默认值是110).

当然,根据本文这里的case,实际上也可以通过move aud$到其他表空间来解决或关闭审计来绕过这个问题。


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注