最近一周遇到了好几个case,简单的总结一下,跟大家分享.
CASE 1.
某客户的一套9208的库,在跑批处理的时候,异常crash了,其中alert 信息如下:
Thread 1 advanced to log sequence 51938
Current log# 2 seq# 51938 mem# 0: /oradata/cifs/redologfiles/redo21.log
Current log# 2 seq# 51938 mem# 1: /oradata/cifs/redologfiles/redo22.log
Sun Jun 26 11:35:16 2011
Hex dump of Absolute File 4, Block 53784 in trace file /oracle/app/oracle/admin/cifs/
bdump/cifs_dbw0_225460.trc
***
Corrupt block relative dba: 0x0100d218 (file 4, block 53784)
Bad header found during preparing block for write
Data in bad block -
type: 6 format: 2 rdba: 0x04c05738
last change scn: 0x0001.42162bd7 seq: 0x2 flg: 0x04
consistency value in tail: 0x2bd70602
check value in block header: 0x3d2d, computed block checksum: 0x0
spare1: 0x0, spare2: 0x0, spare3: 0x0
***
Sun Jun 26 11:35:16 2011
Errors in file /oracle/app/oracle/admin/cifs/bdump/cifs_dbw0_225460.trc:
ORA-00600: internal error code, arguments: [kcbzpb_1], [16831000], [4], [1], [], [], [], []
Sun Jun 26 11:35:17 2011
Errors in file /oracle/app/oracle/admin/cifs/bdump/cifs_dbw0_225460.trc:
ORA-00600: internal error code, arguments: [kcbzpb_1], [16831000], [4], [1], [], [], [], []
DBW0: terminating instance due to error 600
Instance terminated by DBW0, pid = 225460
Sun Jun 26 13:06:07 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
SCN scheme 3
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.8.0.
从上面的Bad header found during preparing block for write 信息可以看出,dbw0进程在写数据库的时候,
发现这个block header信息异常。我们知道 tail=scn后4位+flg+seq 根据这个规则来看,很明显信息不对。
单从这个ora-600错误来看,metalink也有相关的文档解释,总的来说就是bug,这里根据
16831000 为10进制,首先转换为16进制,然后再转换为 file 4 和 block 53784,即得到上面的坏块号。
metalink 文档,大家可以参考ORA-600 [kcbzpb_1] [ID 139011.1] 其中提到的
bug 6401576 9.2.0.8.P22 OERI[ktbair1] / ORA-600 [6101] index corruption possible
跟我这里遇到的情况基本上一致,对于这种逻辑错误,根据oracle的解释,startup实例即可消除问题。
CASE 2.
某客户的其中一套10203 rac(asm)的某个节点出现ora-600错误,alert信息如下:
Fri Jun 24 09:06:49 2011
Thread 1 advanced to log sequence 7331
Current log# 2 seq# 7331 mem# 0: +DG_DATA/cmsdb1/onlinelog/redo02_1.log
Current log# 2 seq# 7331 mem# 1: +DG_DATA/cmsdb1/onlinelog/redo02_2.log
Fri Jun 24 10:32:58 2011
Errors in file /oracle/admin/cmsdb1/udump/cmsdb11_ora_2083.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Fri Jun 24 10:33:05 2011
Trace dumping is performing id=[cdmp_20110624103305]
Fri Jun 24 10:33:18 2011
###### 省略部分内容 ######
Fri Jun 24 15:09:28 2011
Errors in file /oracle/admin/cmsdb1/udump/cmsdb11_ora_8188.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Fri Jun 24 15:09:29 2011
Trace dumping is performing id=[cdmp_20110624150929]
从其中的某个trace文件,我发现了如下内容:
*** 2011-06-24 10:33:20.755
*** SERVICE NAME:(cmsdb1) 2011-06-24 10:33:20.731
*** SESSION ID:(173.17308) 2011-06-24 10:33:20.731
row 02099eae.1 continuation at
file# 8 block# 630446 slot 2 not found
**************************************************
KDSTABN_GET: 0 ….. ntab: 1
curSlot: 2 ….. nrows: 8
**************************************************
*** 2011-06-24 10:33:20.755
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT m.url,p.pubtime FROM mastersite m, pagelayout p WHERE m.nodeid = p.nodeid and m.articleid=p.articleid AND p.pubtime BETWEEN SYSDATE – (:”SYS_B_0″ / :”SYS_B_1″) AND SYSDATE AND m.url IS NOT NULL AND m.url not like :”SYS_B_2″ order by p.pubtime
—– Call Stack Trace —–
calling call entry argument values in hex
location type point (? means dubious value)
——————– ——– ——————– —————————-
ksedmp()+744 CALL ksedst() 000000840 ?
FFFFFFFF7FFF915C ?
000000000 ?
FFFFFFFF7FFF5C50 ?
FFFFFFFF7FFF49B8 ?
FFFFFFFF7FFF53B8 ?
kgerinv()+200 PTR_CALL 0000000000000000 000106400 ? 10652C964 ?
10652C000 ? 00010652C ?
000106400 ? 10652C964 ?
kgeasnmierr()+28 CALL kgerinv() 106527318 ? 000000000 ?
10610F130 ? 000000000 ?
FFFFFFFF7FFF9600 ?
000001430 ?
kdsgrp()+1252 CALL kgeasnmierr() 106527318 ?
FFFFFFFF7CF26270 ?
10610F130 ? 000000000 ?
10610F000 ? 00010610F ?
kdsfbr()+240 CALL kdsgrp() FFFFFFFF7CF575E0 ?
000000000 ? 000106000 ?
10610F130 ? 000000000 ?
10652C958 ?
qertbFetchByRowID() CALL kdsfbr() FFFFFFFF7CF575E0 ?
+996 FFFFFFFF7CF57CCC ?
FFFFFFFF7CF575E0 ?
000106529 ? 800137DC0 ?
106529DB0 ?
qergiFetch()+344 PTR_CALL 0000000000000000 1056DA000 ? 0000010E4 ?
000000000 ? 000000001 ?
FFFFFFFF7CF57548 ?
FFFFFFFF7CF57CC8 ?
qerjoFetch()+580 PTR_CALL 0000000000000000 47A4819D0 ? 000000000 ?
000000000 ?
FFFFFFFF7CF57738 ?
000000027 ? 000000800 ?
qertbFetchByRowID() PTR_CALL 0000000000000000 000000000 ? 1031FF060 ?
+556 FFFFFFFF7CF57C60 ?
47A4815E0 ? 47A4813F8 ?
000008000 ?
rwsfcd()+120 PTR_CALL 0000000000000000 FFFFFFFF7CF5AC78 ?
000000040 ? 47A481140 ?
000007FFF ? 47A4810D0 ?
00000013C ?
qerflFetch()+172 PTR_CALL 0000000000000000 47A480B30 ? 10320D5A0 ?
FFFFFFFF7FFFA908 ?
000007FFF ? 0000003C0 ?
105A287B0 ?
qersoFetch()+720 PTR_CALL 0000000000000000 FFFFFFFF7CF5C260 ?
103291C80 ? 47A4806E0 ?
000007FFF ? 10320D000 ?
103290B40 ?
opifch2()+8092 CALL kpofrws() 105A287B0 ? 1024D8940 ?
105A28000 ?
FFFFFFFF7CF5C278 ?
000000020 ? 47A4809B8 ?
kpoal8()+4196 CALL opifch2() 000000020 ?
FFFFFFFF7CF3A830 ?
000000000 ? 000106400 ?
000000001 ? 47A4861E8 ?
opiodr()+1548 PTR_CALL 0000000000000000 000000020 ? 000000000 ?
000105800 ? 000000004 ?
000000000 ? 106534460 ?
ttcpip()+1284 PTR_CALL 0000000000000000 10576A380 ? 00000005E ?
106527200 ? 000000001 ?
FFFFFFFF7CF3A830 ?
000106529 ?
opitsk()+1432 CALL ttcpip() 0000001AF ?
FFFFFFFF7FFFD220 ?
1056C06EC ? 1056BDED0 ?
000000000 ? 106527318 ?
opiino()+1128 CALL opitsk() 106534468 ? 000000001 ?
000000000 ? 106534460 ?
105884B38 ? 0FFFFFFFD ?
opiodr()+1548 PTR_CALL 0000000000000000 000106400 ? 106529D98 ?
000106400 ? 106529000 ?
000106400 ? 106534460 ?
opidrv()+896 CALL opiodr() 1065335D8 ? 00000003C ?
000106400 ? 1065343E0 ?
000106534 ? 000106529 ?
sou2o()+80 CALL opidrv() 106536B60 ? 000000000 ?
00000003C ? 106533898 ?
00000003C ? 000000000 ?
opimai_real()+124 CALL sou2o() FFFFFFFF7FFFF978 ?
00000003C ? 000000004 ?
FFFFFFFF7FFFF9A0 ?
105E0E000 ? 000105E0E ?
main()+152 CALL opimai_real() 000000002 ?
FFFFFFFF7FFFFA78 ?
104051BEC ? 1064CF450 ?
00247D764 ? 000014800 ?
_start()+380 CALL main() 000000002 ? 000000008 ?
000000000 ?
FFFFFFFF7FFFFA88 ?
FFFFFFFF7FFFFB98 ?
FFFFFFFF7D500200 ?
其中找了两篇比较有价值而且比较靠谱的文章,如下:
Getting Ora-600 [kdsgrp1] On A Select [ID 429746.1]
After Upgrade To 10203 ,Select * From Gv$Aq Returns Ora-00600[kdsgrp1] Or [Qertbfetchbyrowid] [ID 603095.1]
最开始看ORA-600 [kdsgrp1] [ID 285586.1] 这篇文档的时候,我感觉这个问题很可能就是bug,然而看了上面2篇文章以后,
感觉不见得一定就是bug,另外还看了好几篇文档,都是与index相关或CR相关的,所以我断定这个问题是index和table之间
的数据不一致导致的。
其中Getting Ora-600 [kdsgrp1] On A Select [ID 429746.1] 这篇是最有价值的,提供了一个解决思路:
首先根据ora-600错误提到的file和block号,定位到具体的对象:
SELECT tablespace_name, segment_type, owner, segment_name
FROM dba_extents
WHERE file_id = 8
and 630446 between block_id AND block_id + blocks - 1;
然后从trace中搜索 Plan Table,找到执行计划,根据前面查到的对象,然后确定到具体的index名称。
============
Plan Table
============
------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 1230 | | | |
| 1 | COUNT STOPKEY | | | | | | | |
| 2 | VIEW | | 113 | 867K | 1230 | 00:00:15 | | |
| 3 | SORT ORDER BY STOPKEY | | 113 | 155K | 1230 | 00:00:15 | | |
| 4 | NESTED LOOPS OUTER | | 113 | 155K | 1229 | 00:00:15 | | |
| 5 | PARTITION RANGE SINGLE | | 113 | 149K | 1228 | 00:00:15 | KEY | KEY |
| 6 | TABLE ACCESS BY LOCAL INDEX ROWID | PAGELAYOUT | 113 | 149K | 1228 | 00:00:15 | KEY | KEY |
| 7 | INDEX RANGE SCAN | IDX_PAGELAYOUT_T_NODEID| 214 | | 7 | 00:00:01 | KEY | KEY |
| 8 | TABLE ACCESS BY INDEX ROWID | DATASNIPARTICLE | 1 | 57 | 1 | 00:00:01 | | |
| 9 | INDEX UNIQUE SCAN | SYS_C0034918 | 1 | | 0 | | | |
------------------------------------------------------------------------------------------------------------------------------
最后drop和recreate该index即可解决该问题。
补充:从别的trace中,还发现了update语句出现该错误,如下:
*** 2011-06-24 10:33:18.007
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Current SQL statement for this session:
update pagelayout set publishstate=:1 where articleid = :2 and (masterid = :3 or nodeid = :4)
从而我确定是表pagelayout上的某个index出现问题。
CASE 3.
上周出差到吉林某客户,巡检时发现某套rac的awr居然没有生成,而且手工创建快照也没反应。总之说来,
比较怪异,而且还发现某个节点的/oracle 一共20g,oracle本身才8G左右,另外的12g不知道被什么东西
占据了,ls -ltr居然无法看见,经主机工程师检查,发现是diag进程占据了12g的磁盘空间。
在操作之前,我做了一个systemstate dump,对该dump我简单的分析了一下,发现了如下有用信息:
System global information:
processes: base c0000000ad649960, size 500, cleanup c0000000ad68c020
allocation: free sessions c0000000ad78d620, free calls 0000000000000000
control alloc errors: 0 (process), 0 (session), 0 (call)
PMON latch cleanup depth: 0
seconds since PMON's last scan for dead processes: 19 ###### 有19个死进程
###### 我们首先来搜索diag进程:######
PROCESS 3:
—————————————-
SO: c0000000ad68c810, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=3, calls cur/top: c0000000adab8358/c0000000adab8358, flag: (2) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 24
last post received-location: ksasnd
last process to post me: c0000000ad6956f0 235 0
last post sent: 0 0 48
last post sent-location: ksoreq_reply
last process posted by me: c0000000ad782910 1 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c0000000ad784f20
O/S info: user: oracle, term: UNKNOWN, ospid: 27628
OSD pid info: Unix process pid: 27628, image: oracle@uamdb2 (DIAG)
###### 然后搜索c0000000ad68c810 字符串,发现如下内容:######
SO: c0000000ab2bac00, type: 19, owner: c0000000ad68c810, flag: INIT/-/-/0x00
GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=0
outq=0 sndq=0 opid=0 prmb=0x0
mbg[i]=(0 0) mbg[b]=(0 0) mbg[r]=(0 0)
fmq[i]=(0 0) fmq[b]=(0 0) fmq[r]=(0 0)
mop[s]=0 mop[q]=0 pendq=0 zmbq=0
nonksxp_recvs=0
————process 0xc0000000ab2bac00——————–
proc version : 0
Local node : 1
pid : 27628
lkp_node : 1
svr_mode : 0
proc state : KJP_UNFREEZE
Last drm hb acked : 0
Total accesses : 5
Imm. accesses : 0
Locks on ASTQ : 0
Locks Pending AST : 0
Granted locks : 0
AST_Q:
PENDING_Q:
GRANTED_Q:
—————————————-
SO: c0000000ada72528, type: 4, owner: c0000000ad68c810, flag: INIT/-/-/0x00
(session) sid: 554 trans: 0000000000000000, creator: c0000000ad68c810, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'DIAG idle wait' blocking sess=0x0000000000000000 seq=331 wait_time=0 seconds since wait started=62193
component=1, where=1, wait time(millisec)=c8
Dumping Session Wait History
for 'DIAG idle wait' count=1 wait_time=195274
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195286
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=214760
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195267
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195282
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195282
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195281
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195278
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195280
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195270
component=1, where=1, wait time(millisec)=c8
temporary object counter: 0
—————————————-
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log c0000000acb7a5a0
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 554 efd 0 whr 0 slp 0
KGX Atomic Operation Log c0000000acb7a5e8
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 554 efd 0 whr 0 slp 0
KGX Atomic Operation Log c0000000acb7a630
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 554 efd 0 whr 0 slp 0
—————————————-
SO: c0000000a1a7fa18, type: 41, owner: c0000000ada72528, flag: INIT/-/-/0x00
(dummy) nxc=0, nlb=0
—————————————-
SO: c0000000adab8358, type: 3, owner: c0000000ad68c810, flag: INIT/-/-/0x00
(call) sess: cur c0000000ada72528, rec 0, usr c0000000ada72528; depth: 0
—————————————-
SO: c0000000ab2adec8, type: 16, owner: c0000000ad68c810, flag: INIT/-/-/0x00
(osp req holder)
###### 然后来看mmon进程:######
SO: c0000000add59f60, type: 11, owner: c0000000ad692f40, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad692f40,
event: 1, last message event: 1381841,
last message waited event: 1381841, messages read: 1376494
channel: (c0000000add81260) MMON remote action broadcast channel
scope: 27, event: 1381841, last mesage event: 1381841,
publishers/subscribers: 2/2,
messages published: 1376494
—————————————-
SO: c0000000ada60ee0, type: 4, owner: c0000000ad692f40, flag: INIT/-/-/0x00
(session) sid: 541 trans: 0000000000000000, creator: c0000000ad692f40, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0002-0010-00000003, short-term DID: 0002-0010-00000004
txn branch: 0000000000000000
oct: 0, prv: 0, sql: c0000000ac7f3760, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'rdbms ipc message' blocking sess=0x0000000000000000 seq=49622 wait_time=0 seconds since wait started=2
timeout=12c, =0, =0
Dumping Session Wait History
for 'rdbms ipc message' count=1 wait_time=1503849
timeout=99, =0, =0
for 'rdbms ipc message' count=1 wait_time=1435494
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2939422
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=517538
timeout=34, =0, =0
for 'rdbms ipc message' count=1 wait_time=2421822
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2470581
timeout=fc, =0, =0
for 'rdbms ipc message' count=1 wait_time=1
timeout=fc, =0, =0
for 'rdbms ipc message' count=1 wait_time=468661
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2933950
timeout=12c, =0, =0
for 'os thread startup' count=1 wait_time=41005
=0, =0, =0
另外看了几个其他进程的信息,发现都是在等待 rdbms ipc message。
记得当时还做了一个oradebug hanganalyze 3,发现lck0进程居然阻塞了451个objects。
由于当时的产生的trace我忘记取下来了, 所以就只能用这个trace中去寻找信息了。
###### 搜索lck字符串:######
PROCESS 20:
—————————————-
SO: c0000000ad694f00, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=20, calls cur/top: c0000000adabb810/c0000000adabb810, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 33
last post received-location: ksrpublish
last process to post me: c0000000ad694f00 1 6
last post sent: 0 0 33
last post sent-location: ksrpublish
last process posted by me: c0000000ad694f00 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c0000000ad784f20
O/S info: user: oracle, term: UNKNOWN, ospid: 27684
OSD pid info: Unix process pid: 27684, image: oracle@uamdb2 (LCK0)
###### 然后搜索字符串c0000000ad694f00 ######
SO: c0000000adbb9fd0, type: 5, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(enqueue) XR-00000000-00000000 DID: 0000-0014-00000003
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x3
res: 0xc0000000adcbf510, mode: NULL, lock_flag: 0x0
own: 0xc0000000ada5b940, sess: 0xc0000000ada5b940, proc: 0xc0000000ad694f00, prv: 0xc0000000adcbf520
slk: 0xc0000000ab40db98
这段内容是enqueue state object信息,发现该进程持有一个enqueue。
从这里的 res_flag: 0x3 可以判断:这个enqueue是global类型的,而且正在被持有,还没被free。
继续往下搜索,发现了如下类似的大量信息:
SO: c0000000add5e1e8, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00,
event: 3, last message event: 3,
last message waited event: 3, messages read: 0
channel: (c0000000add81108) kea interrupt signal channel
scope: 5, event: 3, last mesage event: 0,
publishers/subscribers: 1/2,
messages published: 0
—————————————-
SO: c0000000add5e0d8, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00,
event: 3, last message event: 3326313,
last message waited event: 3326313, messages read: 3324562
channel: (c0000000add7ee18) kxfp control signal channel
scope: 5, event: 3326313, last mesage event: 3326313,
publishers/subscribers: 4/2,
messages published: 3324562
这里的type值为11,代表的意思是:Broadcast Handle(即广播消息句柄)。难怪做oradebug的时候,发现这个lck进程
阻塞了451个objects,而且等待居然都是rdbms ipc message。
继续往下搜索,发现了如下关键信息:
—————————————-
SO: c0000000ada5b940, type: 4, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(session) sid: 537 trans: 0000000000000000, creator: c0000000ad694f00, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0014-00000003, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'rdbms ipc message' blocking sess=0x0000000000000000 seq=43421 wait_time=0 seconds since wait started=1
timeout=12c, =0, =0
Dumping Session Wait History
for 'ksxr poll remote instances' count=1 wait_time=7
=0, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=7
=0, =0, =0
for 'rdbms ipc message' count=1 wait_time=1906819
timeout=c2, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=9
=0, =0, =0
for 'rdbms ipc message' count=1 wait_time=12
timeout=c2, =0, =0
for 'rdbms ipc message' count=1 wait_time=1041870
timeout=12c, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=7
=0, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=10
=0, =0, =0
for 'rdbms ipc message' count=1 wait_time=807692
timeout=52, =0, =0
for 'rdbms ipc message' count=1 wait_time=27
timeout=52, =0, =0
temporary object counter: 0
—————————————-
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log c0000000aca2e6e8
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Cursor Pin uid 537 efd 3 whr 11 slp 0
KGX Atomic Operation Log c0000000aca2e730
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 537 efd 0 whr 0 slp 0
KGX Atomic Operation Log c0000000aca2e778
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 537 efd 0 whr 0 slp 0
—————————————-
SO: c0000000a1a7fe78, type: 41, owner: c0000000ada5b940, flag: INIT/-/-/0x00
(dummy) nxc=0, nlb=0
—————————————-
SO: c0000000add5ab28, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00,
event: 20, last message event: 26,
last message waited event: 26, messages read: 1
channel: (c0000000add7a2d8) system events broadcast channel
scope: 2, event: 204545, last mesage event: 26,
publishers/subscribers: 0/75,
messages published: 1
—————————————-
SO: c0000000ab2ba708, type: 19, owner: c0000000ad694f00, flag: INIT/-/-/0x00
GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=36
outq=0 sndq=0 opid=20 prmb=0x0
mbg[i]=(3277535 648561) mbg[b]=(0 0) mbg[r]=(0 0)
fmq[i]=(4 0) fmq[b]=(0 0) fmq[r]=(0 0)
mop[s]=648561 mop[q]=3277535 pendq=0 zmbq=0
nonksxp_recvs=0
————process 0xc0000000ab2ba708——————–
proc version : 0
Local node : 1
pid : 27684
lkp_node : 1
svr_mode : 0
proc state : KJP_NORMAL
Last drm hb acked : 0
Total accesses : 221605743
Imm. accesses : 196696635
Locks on ASTQ : 0
Locks Pending AST : 113
Granted locks : 56081
AST_Q:
PENDING_Q:
lp 0xc0000000acb505c0 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x9565a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
lp 0xc00000007a220300 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
lp 0xc000000074ac4c38 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
lp 0xc0000000748ca940 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a
这里的rp 0xc000000094eeda88 我猜应该是enqueue等待的资源名,于是搜索resouce 0xc000000094eeda88字符串,发现了如下信息:
———-enqueue 0xc00000007a3cb2e0————————
lock version : 1
Owner node : 1
grant_level : KJUSERNL
req_level : KJUSERPR
bast_level : KJUSERNL
notify_func : 0x0000000000000000
resp : 0xc000000094eeda88
procp : 0xc0000000ab2ba708
pid : 27684
proc version : 0
oprocp : 0x0000000000000000
opid : 0
group lock owner : 0x0000000000000000
xid : 0000-0000-00000000
dd_time : 0.0 secs
dd_count : 0
timeout : 0.0 secs
On_timer_q? : N
On_dd_q? : N
lock_state : CONVERTING
Open Options : KJUSERPROCESS_OWNED KJUSERCLIENTLOCK
Convert options : KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
History : 0x565a565a
Msg_Seq : 0x0
res_seq : 1
valblk : 0x000000020001fcc00000000000000000 .
———-resource 0xc000000094eeda88———————-
resname : [0x2345][0xffffffff],[IV]
Local node : 1
dir_node : 1
master_node : 1
hv idx : 35
hv last r.inc : 6
current inc : 36
hv status : 0
hv master : 1
open options :
grant_bits : KJUSERNL KJUSERPR
grant mode : KJUSERNL KJUSERCR KJUSERCW KJUSERPR KJUSERPW KJUSEREX
count : 114 0 0 333 0 0
val_state : KJUSERVS_VALUE
valblk : 0x000000020001fcc00000000000000000 .
access_node : 1
vbreq_state : 0
state : x8
resp : 0xc000000094eeda88
On Scan_q? : N
Total accesses: 4524902
Imm. accesses: 1948489
Granted_locks : 333
Cvting_locks : 114
总的来说,给人的感觉似乎lck是罪魁祸首,其实不竟然。
用awk脚本格式化了这个systemstate dump后,得到下面的结果:
System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer' wait
3: waiting for 'DIAG idle wait' wait
4: waiting for 'rdbms ipc message' wait
5: waiting for 'rdbms ipc message' wait
6: waiting for 'ges remote message' wait
7: waiting for 'gcs remote message' wait
8: waiting for 'gcs remote message' wait
9: waiting for 'rdbms ipc message' wait
10: waiting for 'rdbms ipc message' wait
11: waiting for 'rdbms ipc message' wait
12: waiting for 'rdbms ipc message' wait
13: waiting for 'smon timer' wait
14: waiting for 'rdbms ipc message' wait
15: waiting for 'rdbms ipc message' wait
16: waiting for 'rdbms ipc message' wait
17: waiting for 'rdbms ipc message' wait
18:
19:
20: waiting for 'rdbms ipc message' wait
21: waiting for 'SQL*Net message from client' wait
22: waiting for 'rdbms ipc message' wait
23: waiting for 'rdbms ipc message' wait
24: waiting for 'SQL*Net message from client' wait
25: waiting for 'Streams AQ: qmn coordinator idle wait' wait
26: waiting for 'SQL*Net message from client' wait
27: waiting for 'SQL*Net message from client' wait
Cmd: PL/SQL Execute
28: for 'Streams AQ: waiting for time management or cleanup tasks' wait
29:
30: waiting for 'SQL*Net message from client' wait
31: waiting for 'SQL*Net message from client' wait
32: waiting for 'SQL*Net message from client' wait
33: waiting for 'SQL*Net message from client' wait
34: waiting for 'SQL*Net message from client' wait
35: waiting for 'SQL*Net message from client' wait
36: waiting for 'SQL*Net message from client' wait
37: waiting for 'SQL*Net message from client' wait
38: waiting for 'SQL*Net message from client' wait
39: waiting for 'SQL*Net message from client' wait
Cmd: Insert
40: waiting for 'SQL*Net message from client' wait
41: waiting for 'SQL*Net message from client' wait
42: last wait for 'SQL*Net message from client'
43: waiting for 'SQL*Net message from client' wait
45: waiting for 'Streams AQ: qmn slave idle wait' wait
46: waiting for 'SQL*Net message from client' wait
Cmd: Insert
47: waiting for 'SQL*Net message from client' wait
48: waiting for 'SQL*Net message from client' wait
49: waiting for 'SQL*Net message from client' wait
50: waiting for 'SQL*Net message from client' wait
52: waiting for 'SQL*Net message from client' wait
53: waiting for 'SQL*Net message from client' wait
54: waiting for 'SQL*Net message from client' wait
55: waiting for 'SQL*Net message from client' wait
56: waiting for 'SQL*Net message from client' wait
58: waiting for 'SQL*Net message from client' wait
59: waiting for 'SQL*Net message from client' wait
60: waiting for 'SQL*Net message from client' wait
61: waiting for 'SQL*Net message from client' wait
65: waiting for 'SQL*Net message from client' wait
66: waiting for 'SQL*Net message from client' wait
90: waiting for 'SQL*Net message from client' wait
91: waiting for 'SQL*Net message from client' wait
92: waiting for 'SQL*Net message from client' wait
93: waiting for 'SQL*Net message from client' wait
94: waiting for 'SQL*Net message from client' wait
95: waiting for 'SQL*Net message from client' wait
96: waiting for 'SQL*Net message from client' wait
106:waiting for 'SQL*Net message from client' wait
107:waiting for 'SQL*Net message from client' wait
108:waiting for 'SQL*Net message from client' wait
109:waiting for 'SQL*Net message from client' wait
110:waiting for 'SQL*Net message from client' wait
111:waiting for 'SQL*Net message from client' wait
112:waiting for 'SQL*Net message from client' wait
113:waiting for 'SQL*Net message from client' wait
114:waiting for 'SQL*Net message from client' wait
115:waiting for 'SQL*Net message from client' wait
116:waiting for 'SQL*Net message from client' wait
117:waiting for 'SQL*Net message from client' wait
120:waiting for 'SQL*Net message from client' wait
NO BLOCKING PROCESSES FOUND
881983 Lines Processed.
从分析结果来看,断定pmon进程才是真正的罪魁祸首,搜索pmon关键字:
PROCESS 2:
—————————————-
SO: c0000000ad68c020, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: c0000000adab8098/c0000000adab8098, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 48
last post received-location: ksoreq_reply
last process to post me: c0000000ad692750 4 2
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c0000000ad68d000 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c0000000ad784f20
O/S info: user: oracle, term: UNKNOWN, ospid: 27626
OSD pid info: Unix process pid: 27626, image: oracle@uamdb2 (PMON)
###### 省略部分信息 ######
—————————————-
SO: c0000000add586a8, type: 11, owner: c0000000ad68c020, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad68c020,
event: 1, last message event: 1,
last message waited event: 1, messages read: 0
channel: (c0000000add7e4b0) scumnt mount lock
scope: 1, event: 15, last mesage event: 0,
publishers/subscribers: 0/14,
messages published: 0
—————————————-
SO: c0000000ada73a90, type: 4, owner: c0000000ad68c020, flag: INIT/-/-/0x00
(session) sid: 555 trans: 0000000000000000, creator: c0000000ad68c020, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0002-0002-00000004, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'pmon timer' blocking sess=0x0000000000000000 seq=50 wait_time=0 seconds since wait started=75789
duration=12c, =0, =0
Dumping Session Wait History
for 'pmon timer' count=1 wait_time=2929152
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929462
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929351
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929175
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2928083
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929290
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929509
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929262
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929283
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929525
duration=12c, =0, =0
temporary object counter: 0
2929152/3600/24=33天左右。换句话说,pmon进程在1个月前就出现异常了,同事也说5月26号的时候重启过,
后来似乎就有问题了,掐指一算,时间似乎刚刚对上。
最后我在中午申请了一点停机时间,我们进行了处理,首先我kill了diag进程,12g的磁盘空间得到了释放,
最开始我断定是lck进程导致的,于是kill了该进程,结果直接导致该节点重启了,不过还好的是,23s就重启完成了。
CASE 4.
这个是昨天的一天case,某个客户方工程师的一套hp 9204系统(HA)环境,记得是我以前部署的。
客户工程师给listener设置密码以后,用脚本来启动监听,发现报错,非常的奇怪,信息如下:
TNS-01169: The listener has not recognized the password
当时他的脚本如下:
su - oracle<
但看脚本,似乎没啥问题,最后我才发现是有问题,那也是折腾1个小时之后才发现的,几个同事一起看,都没发现问题的所在。
老实说,我也偶然发现的,记得快下班的时候,想起了老白的一句话: 像oracle一样思考
想到这句话,突然另我茅塞顿开,我当时是这样认为的:
==> 既然我们要对listener设置密码,那么必然是为了安全性;
==> 通过set password roger 加到脚本中来启动,要是有读权限的用户不就能看到密码了吗?那还谈何安全性呢?
很明显,这里就自相矛盾了;
于是我尝试将脚本修改为如下:
su - oracle<
再次执行,发现居然ok了,最后我又想,既然是要输入加密后的密码而不是明文,那么metalink文档应该有明确的说明啊;
记得在开始遇到这个问题之后,我也看了metalink文档
How To Start/Stop A Password Protected Listener From A Script [ID 361919.1]
但是只是扫了一眼,没有引起重视,最后解决问题以后,我再次看了下,原来如此:
set password encrypted_password -- 其实文档是明确说了,要输入加密后的密码
这当然说明了我当时太马虎了,不过也确实容易忽略。
补充一下,如果在10g中进行类似的测试,本机操作需要设置参数LOCAL_OS_AUTHENTICATION_为off。
发表回复