database crash with ora-00494
群论坛一个网友的问题,原帖地址:
http://databasefans.com/read.php?tid=273
###### alert 如下 ######
Thu Mar 15 22:59:38 2012
Errors in file /app/oracle/admin/cn3fsa/bdump/cn3fsa_arc1_299280.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 524594'
Thu Mar 15 22:59:40 2012
System State dumped to trace file /app/oracle/admin/cn3fsa/bdump/cn3fsa_arc1_299280.trc
Thu Mar 15 22:59:41 2012
Thread 1 cannot allocate new log, sequence 167905
Checkpoint not complete
Current log# 5 seq# 167904 mem# 0: /data02/oradata/cn3fsa/redo5a.rdo
Current log# 5 seq# 167904 mem# 1: /data03/oradata/cn3fsa/redo5b.rdo
Thu Mar 15 22:59:47 2012
Thread 1 advanced to log sequence 167905 (LGWR switch)
Current log# 7 seq# 167905 mem# 0: /data04/oradata/cn3fsa/redo7a.rdo
Current log# 7 seq# 167905 mem# 1: /data01/oradata/cn3fsa/redo7b.rdo
Thu Mar 15 22:59:48 2012
Killing enqueue blocker (pid=524594) on resource CF-00000000-00000000
by killing session 550.1
Thu Mar 15 22:59:50 2012
Errors in file /app/oracle/admin/cn3fsa/bdump/cn3fsa_ckpt_524594.trc:
ORA-00028: your session has been killed
Thu Mar 15 22:59:50 2012
CKPT: terminating instance due to error 28
Instance terminated by CKPT, pid = 524594
Fri Mar 16 08:50:50 2012
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned off.
IMODE=BR
ILAT =61
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:
processes = 500
sessions = 555
从上面错误,我们可以判断是arch1无法获得CF锁,而阻塞者 524594 很可能是后台核心进行,以至于 被kill后导致实例crash了。 这里需要说明一下的是该机制是这样的: 当进行需要获取CF enqueue时,最大等待时间为900s,超过900s时,将强制kill阻塞进程。通过如下隐含 参数控制这个timeout.
SQL> show parameter controlfile_enq
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_controlfile_enqueue_dump boolean FALSE
_controlfile_enqueue_holding_time integer 120
_controlfile_enqueue_timeout integer 900
_kill_controlfile_enqueue_blocker boolean TRUE
下面我们来看看进程 524594 是什么进程? 通过trace我们可以发现,如下:
Unix process pid: 524594, image: oracle@lgcenjfsa (CKPT)
我们可以发现,原来是ckpt进程被干掉了,导致数据库crash了,那么我们再进一步分析, 是什么进程kill ckpt进程了呢?从上面alert log来看,error是写到arch1 trace里面的, 所以这里我想应该是跟lgwr进程有关系。 从网友提供的trace来看,其中lgwr trace都是如下信息:
*** 2012-03-15 22:53:53.488
Warning: log write time 600ms, size 7893KB
*** 2012-03-15 22:54:00.669
Warning: log write time 540ms, size 7362KB
*** 2012-03-15 22:54:18.592
Warning: log write time 640ms, size 4040KB
*** 2012-03-15 22:54:19.182
Warning: log write time 590ms, size 7412KB
*** 2012-03-15 22:54:30.110
Warning: log write time 910ms, size 538KB
*** 2012-03-15 22:54:36.899
Warning: log write time 550ms, size 7595KB
*** 2012-03-15 22:54:48.679
Warning: log write time 720ms, size 10229KB
*** 2012-03-15 22:54:55.736
Warning: log write time 510ms, size 7491KB
*** 2012-03-15 22:55:07.782
Warning: log write time 830ms, size 4178KB
说明lgwr在不停写,且在等待。 从另外一个trace里面也可以看到如下信息:
*** 2012-03-15 22:43:48.114
Waited for detached process: CKPT for 310 seconds:
*** 2012-03-15 22:43:48.114
Dumping diagnostic information for CKPT:
OS pid = 524594
*** 2012-03-15 22:44:00.687
Waited for detached process: CKPT for 320 seconds:
*** 2012-03-15 22:44:00.687
Dumping diagnostic information for CKPT:
OS pid = 524594
我们知道检查点也是在不停的工作的,如果logfile切换频繁很可能会导致这个情况。 那么最后我们再来看看ckpt进程的状态信息:
*** 2012-03-15 22:59:38.881
Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)
Possible local blocker ospid=524594 sid=550 sser=1 time_held=674 secs (ges mode req=6 held=4)
DUMP LOCAL BLOCKER: initiate state dump for KILL BLOCKER
possible owner[7.524594] on resource CF-00000000-00000000
Dumping process info of pid[7.524594] requested by pid[17.299280]
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 524594'
-------------------------------------------------------------------------------
ENQUEUE [CF] HELD FOR TOO LONG
enqueue holder: 'inst 1, osid 524594'
Process 'inst 1, osid 524594' is holding an enqueue for maximum allowed time.
The process will be terminated.
Oracle Support Services triaging information: to find the root-cause, look
at the call stack of process 'inst 1, osid 524594' located below. Ask the
developer that owns the first NON-service layer in the stack to investigate.
Common service layers are enqueues (ksq), latches (ksl), library cache
pins and locks (kgl), and row cache locks (kqr).
Dumping process 7.524594 info:
*** 2012-03-15 22:59:38.968
Dumping diagnostic information for ospid 524594:
OS pid = 524594
loadavg : 1.38 1.97 2.29
swap info: free_mem = 34.10M rsv = 64.00M
alloc = 7237.58M avail = 16384.00M swap_free = 9146.42M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
240001 T oracle 524594 1 0 60 20 c1d0d590 108076 Mar 12 - 1:35 ora_ckpt_cn3fsa
open: Permission denied
524594: ora_ckpt_cn3fsa
0x000000010025d950 ksfdaio1(??, ??, ??) + 0x330
0x000000010022c988 kcflbi(??, ??, ??, ??, ??, ??, ??, ??) + 0x6a8
0x00000001010f0970 kcbldio(??, ??, ??) + 0x1eb0
0x00000001010f6468 kcblsltio(??, ??, ??, ??) + 0x548
0x00000001010e3b50 krhahra(??) + 0x310
0x00000001010e711c krhahini(??, ??, ??, ??) + 0x57c
0x00000001018fc4fc kcvcfsi(??, ??, ??, ??, ??, ??) + 0x31c
0x00000001018e1744 kcvucpb(??, ??, ??, ??, ??, ??, ??, ??) + 0x1a4
0x00000001018fa344 kcvcca(??, ??) + 0x104
0x000000010015e728 ksbcti(??, ??, ??) + 0x3c8
0x00000001001694fc ksbabs(??) + 0x25c
0x0000000100166554 ksbrdp() + 0x4b4
0x000000010430c09c opirip(??, ??, ??) + 0x3fc
0x0000000102d9a598 opidrv(??, ??, ??) + 0x458
0x000000010370b0b0 sou2o(??, ??, ??, ??) + 0x90
0x0000000100000870 opimai_real(??, ??) + 0x150
0x00000001000006d8 main(??, ??) + 0x98
0x0000000100000368 __start() + 0x98
*** 2012-03-15 22:59:40.936
----------------------------------------
SO: 70000024c3b20e0, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=7, calls cur/top: 70000024f5da8f0/70000024f5da8f0, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 114
last post received-location: kcbbza
last process to post me: 70000024c3b1900 1 6
last post sent: 0 0 9
last post sent-location: ksqrcl
last process posted by me: 70000024c3b9700 106 2
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 70000024c42cbc0
O/S info: user: oracle, term: UNKNOWN, ospid: 524594
OSD pid info: Unix process pid: 524594, image: oracle@lgcenjfsa (CKPT)
SO: 70000024f5934a0, type: 4, owner: 70000024c3b20e0, flag: INIT/-/-/0x00
(session) sid: 550 trans: 0, creator: 70000024c3b20e0, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0007-00000005, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
last wait for 'direct path write' blocking sess=0x0 seq=6421 wait_time=1 seconds since wait started=0
file number=33, first dba=1, block cnt=1
Dumping Session Wait History
for 'direct path write' count=1 wait_time=1
file number=33, first dba=1, block cnt=1
for 'direct path write' count=1 wait_time=1
file number=34, first dba=1, block cnt=1
for 'direct path write' count=1 wait_time=0
file number=35, first dba=1, block cnt=1
for 'direct path write' count=1 wait_time=0
file number=36, first dba=1, block cnt=1
for 'direct path write' count=1 wait_time=0
file number=37, first dba=1, block cnt=1
for 'direct path write' count=1 wait_time=1
file number=38, first dba=1, block cnt=1
for 'direct path write' count=1 wait_time=0
file number=39, first dba=1, block cnt=1
for 'direct path write' count=1 wait_time=0
file number=3a, first dba=1, block cnt=1
for 'direct path write' count=1 wait_time=1
file number=3b, first dba=1, block cnt=1
for 'direct path write' count=1 wait_time=1
file number=3c, first dba=1, block cnt=1
service name: SYS$BACKGROUND
waiting for 'rdbms ipc message' blocking sess=0x0 seq=6432 wait_time=0 seconds since wait started=0
timeout=11c, =0, =0
Dumping Session Wait History
for 'rdbms ipc message' count=1 wait_time=52
timeout=11d, =0, =0
for 'rdbms ipc message' count=1 wait_time=150428
timeout=12c, =0, =0
for 'log buffer space' count=1 wait_time=4
=0, =0, =0
for 'control file sequential read' count=1 wait_time=18
file#=0, block#=3, blocks=1
for 'control file parallel write' count=1 wait_time=32552
files=3, block#=3, requests=3
for 'control file sequential read' count=1 wait_time=13
file#=0, block#=1, blocks=1
for 'control file parallel write' count=1 wait_time=14208
files=3, block#=1, requests=3
for 'control file parallel write' count=1 wait_time=16303
files=3, block#=f, requests=3
for 'control file parallel write' count=1 wait_time=13450
files=3, block#=11, requests=3
for 'control file parallel write' count=1 wait_time=22983
files=3, block#=1a, requests=3
----------------------------------------
SO: 70000024f682f80, type: 5, owner: 70000024f5da8f0, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0000-0007-00000004
lv: 00 60 48 5e 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x70000024c605120, mode: SS, lock_flag: 0x0
own: 0x70000024f5934a0, sess: 0x70000024f5934a0, proc: 0x70000024c3b20e0, prv: 0x70000024c605130
----------------------------------------
我们接着来看看arc1 进程的信息:
PROCESS 17:
----------------------------------------
SO: 70000024c3b4840, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=17, calls cur/top: 70000024f5dc778/70000024f5dc778, flag: (2) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 9
last post received-location: ksqrcl
last process to post me: 70000024f3ab6a0 1 2
last post sent: 0 0 9
last post sent-location: ksqrcl
last process posted by me: 70000024f3ab6a0 1 2
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 70000024c42cbc0
O/S info: user: oracle, term: UNKNOWN, ospid: 299280
OSD pid info: Unix process pid: 299280, image: oracle@lgcenjfsa (ARC1)
SO: 70000024c59e2f8, type: 4, owner: 70000024c3b4840, flag: INIT/-/-/0x00
(session) sid: 541 trans: 0, creator: 70000024c3b4840, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0011-00000002, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
last wait for 'enq: CF - contention' blocking sess=0x0 seq=28186 wait_time=2929713 seconds since wait started=903
name|mode=43460004, 0=0, operation=0
Dumping Session Wait History
for 'enq: CF - contention' count=1 wait_time=2929713 <== 这里的waitime其实是个累积值。
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929711
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929705
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929707
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929709
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929714
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929711
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929712
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929706
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929711
name|mode=43460004, 0=0, operation=0
----------------------------------------
SO: 70000024f5dc778, type: 3, owner: 70000024c3b4840, flag: INIT/-/-/0x00
(call) sess: cur 70000024c59e2f8, rec 70000024c56b588, usr 70000024c59e2f8; depth: 0
----------------------------------------
SO: 70000024f698a10, type: 5, owner: 70000024f5dc778, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-0011-00000002
lv: 00 60 48 5e 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x70000024c605120, lock_flag: 0x0
own: 0x70000024c59e2f8, sess: 0x70000024c59e2f8, prv: 0x70000024f698a40
这里我们可以看到 0x70000024c605120 正是ckpt持有的resouce。 最后我们来看看lgwr进程信息:
PROCESS 6:
----------------------------------------
SO: 70000024f3a8f40, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=6, calls cur/top: 70000024f5da630/70000024f5da630, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 24
last post received-location: ksasnd
last process to post me: 70000024c3b1900 1 6
last post sent: 109 0 4
last post sent-location: kslpsr
last process posted by me: 70000024f3af5a0 158 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 70000024c42cbc0
O/S info: user: oracle, term: UNKNOWN, ospid: 413940
OSD pid info: Unix process pid: 413940, image: oracle@lgcenjfsa (LGWR)
----------------------------------------
SO: 70000024f6831e0, type: 5, owner: 70000024f3a8f40, flag: INIT/-/-/0x00
(enqueue) RT-00000001-00000000 DID: 0000-0006-00000005
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x70000024c605450, mode: X, lock_flag: 0x0
own: 0x70000024c5a4e00, sess: 0x70000024c5a4e00, proc: 0x70000024f3a8f40, prv: 0x70000024c605460
(FOB) flags=2 fib=7000002487ef290 incno=0 pending i/o cnt=0
fname=/data04/oradata/cn3fsa/ctlcn3fsa3.ctl
fno=2 lblksz=16384 fsiz=1204
(FOB) flags=2 fib=7000002487eeef0 incno=0 pending i/o cnt=0
fname=/data03/oradata/cn3fsa/ctlcn3fsa2.ctl
fno=1 lblksz=16384 fsiz=1204
(FOB) flags=2 fib=7000002487eeb50 incno=0 pending i/o cnt=0
fname=/data02/oradata/cn3fsa/ctlcn3fsa1.ctl
fno=0 lblksz=16384 fsiz=1204
----------------------------------------
SO: 70000024c6c2338, type: 11, owner: 70000024f3a8f40, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 70000024f3a8f40,
event: 5, last message event: 13,
last message waited event: 13, messages read: 1
channel: (70000024c6e7790) scumnt mount lock
scope: 1, event: 15, last mesage event: 13,
publishers/subscribers: 0/13,
messages published: 1
----------------------------------------
SO: 70000024c5a4e00, type: 4, owner: 70000024f3a8f40, flag: INIT/-/-/0x00
(session) sid: 551 trans: 0, creator: 70000024f3a8f40, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0006-00000006, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'rdbms ipc message' blocking sess=0x0 seq=36384 wait_time=0 seconds since wait started=0
timeout=129, =0, =0
Dumping Session Wait History
for 'control file sequential read' count=1 wait_time=4
file#=0, block#=16, blocks=1
for 'control file sequential read' count=1 wait_time=4
file#=0, block#=13, blocks=1
for 'control file sequential read' count=1 wait_time=4
file#=0, block#=12, blocks=1
for 'control file sequential read' count=1 wait_time=4
file#=0, block#=10, blocks=1
for 'control file sequential read' count=1 wait_time=5
file#=2, block#=1, blocks=1
for 'control file sequential read' count=1 wait_time=5
file#=1, block#=1, blocks=1
for 'control file sequential read' count=1 wait_time=8
file#=0, block#=1, blocks=1
for 'rdbms ipc message' count=1 wait_time=26490
timeout=12c, =0, =0
for 'control file sequential read' count=1 wait_time=12
file#=0, block#=16, blocks=1
for 'control file sequential read' count=1 wait_time=4
file#=0, block#=13, blocks=1
temporary object counter: 0
从trace前面的锁信息得出如下结果:
Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)
Possible local blocker ospid=524594 sid=550 sser=1 time_held=674 secs (ges mode req=6 held=4)
ckpt: hold 4, request 6
arch1: request 3,hold 6
综上所述,得出如下结论: 1. arch1 等待 ckpt (为了获得CF enqueue); 2. lgwr 在等待 arch1 去写日志; 3. arch1 等待时间超过 900s (从上面来看其实仅仅才674s); 4. lgwr 将进程 ckpt kill (因为ckpt进程阻塞arch进程,而lgwr通知arch1进程写归档,那么就间接是ckpt阻塞了lgwr进程); 5. database crash。 从alert来看,是因为log file切换有点过于频繁了,时间短的就30s左右。 根据oracle mos文档Database Crashes With ORA-00494 [ID 753290.1]的建议是这样的: 1. 调整redo log file,报错切换频率维持在 20~30 min; 2. 对于online logfile大小,可以参考 v$instance_recovery.optimal_logfile_size.
回复 Database crash with ORA-00494 » Oracle Notes 取消回复