最近一周遇到的几个case

6 月 29, 2011

—

由

于 mangement, Oracle rdbms, rac & dg, troubleshooting

最近一周遇到了好几个case，简单的总结一下，跟大家分享.

CASE 1.

某客户的一套9208的库，在跑批处理的时候，异常crash了，其中alert 信息如下：

Thread 1 advanced to log sequence 51938
  Current log# 2 seq# 51938 mem# 0: /oradata/cifs/redologfiles/redo21.log
  Current log# 2 seq# 51938 mem# 1: /oradata/cifs/redologfiles/redo22.log
Sun Jun 26 11:35:16 2011
Hex dump of Absolute File 4, Block 53784 in trace file /oracle/app/oracle/admin/cifs/

bdump/cifs_dbw0_225460.trc
***
Corrupt block relative dba: 0x0100d218 (file 4, block 53784)
Bad header found during preparing block for write
Data in bad block -
 type: 6 format: 2 rdba: 0x04c05738
 last change scn: 0x0001.42162bd7 seq: 0x2 flg: 0x04
 consistency value in tail: 0x2bd70602
 check value in block header: 0x3d2d, computed block checksum: 0x0
 spare1: 0x0, spare2: 0x0, spare3: 0x0
***
Sun Jun 26 11:35:16 2011
Errors in file /oracle/app/oracle/admin/cifs/bdump/cifs_dbw0_225460.trc:
ORA-00600: internal error code, arguments: [kcbzpb_1], [16831000], [4], [1], [], [], [], []
Sun Jun 26 11:35:17 2011
Errors in file /oracle/app/oracle/admin/cifs/bdump/cifs_dbw0_225460.trc:
ORA-00600: internal error code, arguments: [kcbzpb_1], [16831000], [4], [1], [], [], [], []
DBW0: terminating instance due to error 600
Instance terminated by DBW0, pid = 225460
Sun Jun 26 13:06:07 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
SCN scheme 3
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.8.0.

从上面的Bad header found during preparing block for write 信息可以看出，dbw0进程在写数据库的时候，
发现这个block header信息异常。我们知道 tail=scn后4位+flg+seq 根据这个规则来看，很明显信息不对。
单从这个ora-600错误来看，metalink也有相关的文档解释，总的来说就是bug，这里根据
16831000 为10进制，首先转换为16进制，然后再转换为 file 4 和 block 53784，即得到上面的坏块号。
metalink 文档，大家可以参考ORA-600 [kcbzpb_1] [ID 139011.1] 其中提到的
bug 6401576 9.2.0.8.P22 OERI[ktbair1] / ORA-600 [6101] index corruption possible
跟我这里遇到的情况基本上一致，对于这种逻辑错误，根据oracle的解释，startup实例即可消除问题。

CASE 2.

某客户的其中一套10203 rac(asm)的某个节点出现ora-600错误，alert信息如下：

Fri Jun 24 09:06:49 2011
Thread 1 advanced to log sequence 7331
  Current log# 2 seq# 7331 mem# 0: +DG_DATA/cmsdb1/onlinelog/redo02_1.log
  Current log# 2 seq# 7331 mem# 1: +DG_DATA/cmsdb1/onlinelog/redo02_2.log
Fri Jun 24 10:32:58 2011
Errors in file /oracle/admin/cmsdb1/udump/cmsdb11_ora_2083.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Fri Jun 24 10:33:05 2011
Trace dumping is performing id=[cdmp_20110624103305]
Fri Jun 24 10:33:18 2011
###### 省略部分内容 ######
Fri Jun 24 15:09:28 2011
Errors in file /oracle/admin/cmsdb1/udump/cmsdb11_ora_8188.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Fri Jun 24 15:09:29 2011
Trace dumping is performing id=[cdmp_20110624150929]

从其中的某个trace文件，我发现了如下内容：

*** 2011-06-24 10:33:20.755
*** SERVICE NAME:(cmsdb1) 2011-06-24 10:33:20.731
*** SESSION ID:(173.17308) 2011-06-24 10:33:20.731
            row 02099eae.1 continuation at
            file# 8 block# 630446 slot 2 not found
**************************************************
KDSTABN_GET: 0 ….. ntab: 1
curSlot: 2 ….. nrows: 8
**************************************************
*** 2011-06-24 10:33:20.755
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT m.url,p.pubtime  FROM mastersite m, pagelayout p  WHERE m.nodeid = p.nodeid and m.articleid=p.articleid  AND p.pubtime BETWEEN SYSDATE – (:”SYS_B_0″ / :”SYS_B_1″) AND SYSDATE  AND m.url IS NOT NULL  AND m.url not like :”SYS_B_2″  order by p.pubtime
—– Call Stack Trace —–
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
——————– ——– ——————– —————————-
ksedmp()+744         CALL     ksedst()             000000840 ?
                                                   FFFFFFFF7FFF915C ?
                                                   000000000 ?
                                                   FFFFFFFF7FFF5C50 ?
                                                   FFFFFFFF7FFF49B8 ?
                                                   FFFFFFFF7FFF53B8 ?
kgerinv()+200        PTR_CALL 0000000000000000     000106400 ? 10652C964 ?
                                                   10652C000 ? 00010652C ?
                                                   000106400 ? 10652C964 ?
kgeasnmierr()+28     CALL     kgerinv()            106527318 ? 000000000 ?
                                                   10610F130 ? 000000000 ?
                                                   FFFFFFFF7FFF9600 ?
                                                   000001430 ?
kdsgrp()+1252        CALL     kgeasnmierr()        106527318 ?
                                                   FFFFFFFF7CF26270 ?
                                                   10610F130 ? 000000000 ?
                                                   10610F000 ? 00010610F ?
kdsfbr()+240         CALL     kdsgrp()             FFFFFFFF7CF575E0 ?
                                                   000000000 ? 000106000 ?
                                                   10610F130 ? 000000000 ?
                                                   10652C958 ?
qertbFetchByRowID()  CALL     kdsfbr()             FFFFFFFF7CF575E0 ?
+996                                               FFFFFFFF7CF57CCC ?
                                                   FFFFFFFF7CF575E0 ?
                                                   000106529 ? 800137DC0 ?
                                                   106529DB0 ?
qergiFetch()+344     PTR_CALL 0000000000000000     1056DA000 ? 0000010E4 ?
                                                   000000000 ? 000000001 ?
                                                   FFFFFFFF7CF57548 ?
                                                   FFFFFFFF7CF57CC8 ?
qerjoFetch()+580     PTR_CALL 0000000000000000     47A4819D0 ? 000000000 ?
                                                   000000000 ?
                                                   FFFFFFFF7CF57738 ?
                                                   000000027 ? 000000800 ?
qertbFetchByRowID()  PTR_CALL 0000000000000000     000000000 ? 1031FF060 ?
+556                                               FFFFFFFF7CF57C60 ?
                                                   47A4815E0 ? 47A4813F8 ?
                                                   000008000 ?
rwsfcd()+120         PTR_CALL 0000000000000000     FFFFFFFF7CF5AC78 ?
                                                   000000040 ? 47A481140 ?
                                                   000007FFF ? 47A4810D0 ?
                                                   00000013C ?
qerflFetch()+172     PTR_CALL 0000000000000000     47A480B30 ? 10320D5A0 ?
                                                   FFFFFFFF7FFFA908 ?
                                                   000007FFF ? 0000003C0 ?
                                                   105A287B0 ?
qersoFetch()+720     PTR_CALL 0000000000000000     FFFFFFFF7CF5C260 ?
                                                   103291C80 ? 47A4806E0 ?
                                                   000007FFF ? 10320D000 ?
                                                   103290B40 ?
opifch2()+8092       CALL     kpofrws()            105A287B0 ? 1024D8940 ?
                                                   105A28000 ?
                                                   FFFFFFFF7CF5C278 ?
                                                   000000020 ? 47A4809B8 ?
kpoal8()+4196        CALL     opifch2()            000000020 ?
                                                   FFFFFFFF7CF3A830 ?
                                                   000000000 ? 000106400 ?
                                                   000000001 ? 47A4861E8 ?
opiodr()+1548        PTR_CALL 0000000000000000     000000020 ? 000000000 ?
                                                   000105800 ? 000000004 ?
                                                   000000000 ? 106534460 ?
ttcpip()+1284        PTR_CALL 0000000000000000     10576A380 ? 00000005E ?
                                                   106527200 ? 000000001 ?
                                                   FFFFFFFF7CF3A830 ?
                                                   000106529 ?
opitsk()+1432        CALL     ttcpip()             0000001AF ?
                                                   FFFFFFFF7FFFD220 ?
                                                   1056C06EC ? 1056BDED0 ?
                                                   000000000 ? 106527318 ?
opiino()+1128        CALL     opitsk()             106534468 ? 000000001 ?
                                                   000000000 ? 106534460 ?
                                                   105884B38 ? 0FFFFFFFD ?
opiodr()+1548        PTR_CALL 0000000000000000     000106400 ? 106529D98 ?
                                                   000106400 ? 106529000 ?
                                                   000106400 ? 106534460 ?
opidrv()+896         CALL     opiodr()             1065335D8 ? 00000003C ?
                                                   000106400 ? 1065343E0 ?
                                                   000106534 ? 000106529 ?
sou2o()+80           CALL     opidrv()             106536B60 ? 000000000 ?
                                                   00000003C ? 106533898 ?
                                                   00000003C ? 000000000 ?
opimai_real()+124    CALL     sou2o()              FFFFFFFF7FFFF978 ?
                                                   00000003C ? 000000004 ?
                                                   FFFFFFFF7FFFF9A0 ?
                                                   105E0E000 ? 000105E0E ?
main()+152           CALL     opimai_real()        000000002 ?
                                                   FFFFFFFF7FFFFA78 ?
                                                   104051BEC ? 1064CF450 ?
                                                   00247D764 ? 000014800 ?
_start()+380         CALL     main()               000000002 ? 000000008 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFFFA88 ?
                                                   FFFFFFFF7FFFFB98 ?
                                                   FFFFFFFF7D500200 ?

其中找了两篇比较有价值而且比较靠谱的文章，如下：
Getting Ora-600 [kdsgrp1] On A Select [ID 429746.1]
After Upgrade To 10203 ,Select * From Gv$Aq Returns Ora-00600[kdsgrp1] Or [Qertbfetchbyrowid] [ID 603095.1]
最开始看ORA-600 [kdsgrp1] [ID 285586.1] 这篇文档的时候，我感觉这个问题很可能就是bug，然而看了上面2篇文章以后，
感觉不见得一定就是bug，另外还看了好几篇文档，都是与index相关或CR相关的，所以我断定这个问题是index和table之间
的数据不一致导致的。
其中Getting Ora-600 [kdsgrp1] On A Select [ID 429746.1] 这篇是最有价值的，提供了一个解决思路：
首先根据ora-600错误提到的file和block号，定位到具体的对象：

SELECT tablespace_name, segment_type, owner, segment_name
FROM dba_extents
WHERE file_id = 8
and 630446 between block_id AND block_id + blocks - 1;

然后从trace中搜索 Plan Table，找到执行计划，根据前面查到的对象，然后确定到具体的index名称。

============
Plan Table
============
------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                   | Rows  | Bytes | Cost  | Time      | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------
| 0   | SELECT STATEMENT                        |                        |       |       |  1230 |           |       |       |
| 1   |  COUNT STOPKEY                          |                        |       |       |       |           |       |       |
| 2   |   VIEW                                  |                        |   113 |  867K |  1230 |  00:00:15 |       |       |
| 3   |    SORT ORDER BY STOPKEY                |                        |   113 |  155K |  1230 |  00:00:15 |       |       |
| 4   |     NESTED LOOPS OUTER                  |                        |   113 |  155K |  1229 |  00:00:15 |       |       |
| 5   |      PARTITION RANGE SINGLE             |                        |   113 |  149K |  1228 |  00:00:15 | KEY   | KEY   |
| 6   |       TABLE ACCESS BY LOCAL INDEX ROWID | PAGELAYOUT             |   113 |  149K |  1228 |  00:00:15 | KEY   | KEY   |
| 7   |        INDEX RANGE SCAN                 | IDX_PAGELAYOUT_T_NODEID|   214 |       |     7 |  00:00:01 | KEY   | KEY   |
| 8   |      TABLE ACCESS BY INDEX ROWID        | DATASNIPARTICLE        |     1 |    57 |     1 |  00:00:01 |       |       |
| 9   |       INDEX UNIQUE SCAN                 | SYS_C0034918           |     1 |       |     0 |           |       |       |
------------------------------------------------------------------------------------------------------------------------------

最后drop和recreate该index即可解决该问题。
补充：从别的trace中，还发现了update语句出现该错误，如下：

*** 2011-06-24 10:33:18.007
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Current SQL statement for this session:
update pagelayout set publishstate=:1 where articleid = :2 and (masterid = :3 or nodeid = :4)

从而我确定是表pagelayout上的某个index出现问题。

CASE 3.

上周出差到吉林某客户，巡检时发现某套rac的awr居然没有生成，而且手工创建快照也没反应。总之说来，
比较怪异，而且还发现某个节点的/oracle 一共20g，oracle本身才8G左右，另外的12g不知道被什么东西
占据了，ls -ltr居然无法看见，经主机工程师检查，发现是diag进程占据了12g的磁盘空间。
在操作之前，我做了一个systemstate dump，对该dump我简单的分析了一下，发现了如下有用信息：

System global information:
processes: base c0000000ad649960, size 500, cleanup c0000000ad68c020
allocation: free sessions c0000000ad78d620, free calls 0000000000000000
control alloc errors: 0 (process), 0 (session), 0 (call)
PMON latch cleanup depth: 0
seconds since PMON's last scan for dead processes: 19  ###### 有19个死进程
###### 我们首先来搜索diag进程：######
PROCESS 3:
—————————————-
SO: c0000000ad68c810, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=3, calls cur/top: c0000000adab8358/c0000000adab8358, flag: (2) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 24
last post received-location: ksasnd
last process to post me: c0000000ad6956f0 235 0
last post sent: 0 0 48
last post sent-location: ksoreq_reply
last process posted by me: c0000000ad782910 1 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c0000000ad784f20
O/S info: user: oracle, term: UNKNOWN, ospid: 27628
OSD pid info: Unix process pid: 27628, image: oracle@uamdb2 (DIAG)
###### 然后搜索c0000000ad68c810 字符串，发现如下内容：######
SO: c0000000ab2bac00, type: 19, owner: c0000000ad68c810, flag: INIT/-/-/0x00
GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=0
outq=0 sndq=0 opid=0 prmb=0x0
mbg[i]=(0 0) mbg[b]=(0 0) mbg[r]=(0 0)
fmq[i]=(0 0) fmq[b]=(0 0) fmq[r]=(0 0)
mop[s]=0 mop[q]=0 pendq=0 zmbq=0
nonksxp_recvs=0
————process 0xc0000000ab2bac00——————–
proc version      : 0
Local node        : 1
pid               : 27628
lkp_node          : 1
svr_mode          : 0
proc state        : KJP_UNFREEZE
Last drm hb acked : 0
Total accesses    : 5
Imm.  accesses    : 0
Locks on ASTQ     : 0
Locks Pending AST : 0
Granted locks     : 0
AST_Q:
PENDING_Q:
GRANTED_Q:
—————————————-
SO: c0000000ada72528, type: 4, owner: c0000000ad68c810, flag: INIT/-/-/0x00
(session) sid: 554 trans: 0000000000000000, creator: c0000000ad68c810, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'DIAG idle wait' blocking sess=0x0000000000000000 seq=331 wait_time=0 seconds since wait started=62193
component=1, where=1, wait time(millisec)=c8
Dumping Session Wait History
for 'DIAG idle wait' count=1 wait_time=195274
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195286
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=214760
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195267
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195282
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195282
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195281
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195278
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195280
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195270
component=1, where=1, wait time(millisec)=c8
temporary object counter: 0
—————————————-
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log c0000000acb7a5a0
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 554 efd 0 whr 0 slp 0
KGX Atomic Operation Log c0000000acb7a5e8
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 554 efd 0 whr 0 slp 0
KGX Atomic Operation Log c0000000acb7a630
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 554 efd 0 whr 0 slp 0
—————————————-
SO: c0000000a1a7fa18, type: 41, owner: c0000000ada72528, flag: INIT/-/-/0x00
(dummy) nxc=0, nlb=0
—————————————-
SO: c0000000adab8358, type: 3, owner: c0000000ad68c810, flag: INIT/-/-/0x00
(call) sess: cur c0000000ada72528, rec 0, usr c0000000ada72528; depth: 0
—————————————-
SO: c0000000ab2adec8, type: 16, owner: c0000000ad68c810, flag: INIT/-/-/0x00
(osp req holder)
###### 然后来看mmon进程：######
SO: c0000000add59f60, type: 11, owner: c0000000ad692f40, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad692f40,
event: 1, last message event: 1381841,
last message waited event: 1381841, messages read: 1376494
channel: (c0000000add81260) MMON remote action broadcast channel
scope: 27, event: 1381841, last mesage event: 1381841,
publishers/subscribers: 2/2,
messages published: 1376494
—————————————-
SO: c0000000ada60ee0, type: 4, owner: c0000000ad692f40, flag: INIT/-/-/0x00
(session) sid: 541 trans: 0000000000000000, creator: c0000000ad692f40, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0002-0010-00000003, short-term DID: 0002-0010-00000004
txn branch: 0000000000000000
oct: 0, prv: 0, sql: c0000000ac7f3760, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'rdbms ipc message' blocking sess=0x0000000000000000 seq=49622 wait_time=0 seconds since wait started=2
timeout=12c, =0, =0
Dumping Session Wait History
for 'rdbms ipc message' count=1 wait_time=1503849
timeout=99, =0, =0
for 'rdbms ipc message' count=1 wait_time=1435494
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2939422
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=517538
timeout=34, =0, =0
for 'rdbms ipc message' count=1 wait_time=2421822
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2470581
timeout=fc, =0, =0
for 'rdbms ipc message' count=1 wait_time=1
timeout=fc, =0, =0
for 'rdbms ipc message' count=1 wait_time=468661
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2933950
timeout=12c, =0, =0
for 'os thread startup' count=1 wait_time=41005
=0, =0, =0
另外看了几个其他进程的信息，发现都是在等待 rdbms ipc message。
记得当时还做了一个oradebug hanganalyze 3，发现lck0进程居然阻塞了451个objects。
由于当时的产生的trace我忘记取下来了, 所以就只能用这个trace中去寻找信息了。
###### 搜索lck字符串：######
PROCESS 20:
—————————————-
SO: c0000000ad694f00, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=20, calls cur/top: c0000000adabb810/c0000000adabb810, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 33
last post received-location: ksrpublish
last process to post me: c0000000ad694f00 1 6
last post sent: 0 0 33
last post sent-location: ksrpublish
last process posted by me: c0000000ad694f00 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c0000000ad784f20
O/S info: user: oracle, term: UNKNOWN, ospid: 27684
OSD pid info: Unix process pid: 27684, image: oracle@uamdb2 (LCK0)
###### 然后搜索字符串c0000000ad694f00 ######
SO: c0000000adbb9fd0, type: 5, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(enqueue) XR-00000000-00000000 DID: 0000-0014-00000003
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x3
res: 0xc0000000adcbf510, mode: NULL, lock_flag: 0x0
own: 0xc0000000ada5b940, sess: 0xc0000000ada5b940, proc: 0xc0000000ad694f00, prv: 0xc0000000adcbf520
slk: 0xc0000000ab40db98
这段内容是enqueue state object信息，发现该进程持有一个enqueue。
从这里的 res_flag: 0x3 可以判断：这个enqueue是global类型的，而且正在被持有，还没被free。
继续往下搜索，发现了如下类似的大量信息：
SO: c0000000add5e1e8, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00,
event: 3, last message event: 3,
last message waited event: 3, messages read: 0
channel: (c0000000add81108) kea interrupt signal channel
scope: 5, event: 3, last mesage event: 0,
publishers/subscribers: 1/2,
messages published: 0
—————————————-
SO: c0000000add5e0d8, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00,
event: 3, last message event: 3326313,
last message waited event: 3326313, messages read: 3324562
channel: (c0000000add7ee18) kxfp control signal channel
scope: 5, event: 3326313, last mesage event: 3326313,
publishers/subscribers: 4/2,
messages published: 3324562
这里的type值为11，代表的意思是：Broadcast Handle(即广播消息句柄）。难怪做oradebug的时候，发现这个lck进程
阻塞了451个objects，而且等待居然都是rdbms ipc message。
继续往下搜索，发现了如下关键信息：
—————————————-
SO: c0000000ada5b940, type: 4, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(session) sid: 537 trans: 0000000000000000, creator: c0000000ad694f00, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0014-00000003, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'rdbms ipc message' blocking sess=0x0000000000000000 seq=43421 wait_time=0 seconds since wait started=1
timeout=12c, =0, =0
Dumping Session Wait History
for 'ksxr poll remote instances' count=1 wait_time=7
=0, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=7
=0, =0, =0
for 'rdbms ipc message' count=1 wait_time=1906819
timeout=c2, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=9
=0, =0, =0
for 'rdbms ipc message' count=1 wait_time=12
timeout=c2, =0, =0
for 'rdbms ipc message' count=1 wait_time=1041870
timeout=12c, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=7
=0, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=10
=0, =0, =0
for 'rdbms ipc message' count=1 wait_time=807692
timeout=52, =0, =0
for 'rdbms ipc message' count=1 wait_time=27
timeout=52, =0, =0
temporary object counter: 0
—————————————-
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log c0000000aca2e6e8
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Cursor Pin uid 537 efd 3 whr 11 slp 0
KGX Atomic Operation Log c0000000aca2e730
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 537 efd 0 whr 0 slp 0
KGX Atomic Operation Log c0000000aca2e778
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 537 efd 0 whr 0 slp 0
—————————————-
SO: c0000000a1a7fe78, type: 41, owner: c0000000ada5b940, flag: INIT/-/-/0x00
(dummy) nxc=0, nlb=0
—————————————-
SO: c0000000add5ab28, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00,
event: 20, last message event: 26,
last message waited event: 26, messages read: 1
channel: (c0000000add7a2d8) system events broadcast channel
scope: 2, event: 204545, last mesage event: 26,
publishers/subscribers: 0/75,
messages published: 1
—————————————-
SO: c0000000ab2ba708, type: 19, owner: c0000000ad694f00, flag: INIT/-/-/0x00
GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=36
outq=0 sndq=0 opid=20 prmb=0x0
mbg[i]=(3277535 648561) mbg[b]=(0 0) mbg[r]=(0 0)
fmq[i]=(4 0) fmq[b]=(0 0) fmq[r]=(0 0)
mop[s]=648561 mop[q]=3277535 pendq=0 zmbq=0
nonksxp_recvs=0
————process 0xc0000000ab2ba708——————–
proc version      : 0
Local node        : 1
pid               : 27684
lkp_node          : 1
svr_mode          : 0
proc state        : KJP_NORMAL
Last drm hb acked : 0
Total accesses    : 221605743
Imm.  accesses    : 196696635
Locks on ASTQ     : 0
Locks Pending AST : 113
Granted locks     : 56081
AST_Q:
PENDING_Q:
lp 0xc0000000acb505c0 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x9565a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
lp 0xc00000007a220300 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
lp 0xc000000074ac4c38 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
lp 0xc0000000748ca940 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a
这里的rp 0xc000000094eeda88 我猜应该是enqueue等待的资源名，于是搜索resouce 0xc000000094eeda88字符串，发现了如下信息：
———-enqueue 0xc00000007a3cb2e0————————
lock version     : 1
Owner node       : 1
grant_level      : KJUSERNL
req_level        : KJUSERPR
bast_level       : KJUSERNL
notify_func      : 0x0000000000000000
resp             : 0xc000000094eeda88
procp            : 0xc0000000ab2ba708
pid              : 27684
proc version     : 0
oprocp           : 0x0000000000000000
opid             : 0
group lock owner : 0x0000000000000000
xid              : 0000-0000-00000000
dd_time          : 0.0 secs
dd_count         : 0
timeout          : 0.0 secs
On_timer_q?      : N
On_dd_q?         : N
lock_state       : CONVERTING
Open Options     :  KJUSERPROCESS_OWNED KJUSERCLIENTLOCK
Convert options  : KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
History          : 0x565a565a
Msg_Seq          : 0x0
res_seq          : 1
valblk           : 0x000000020001fcc00000000000000000 .
———-resource 0xc000000094eeda88———————-
resname       : [0x2345][0xffffffff],[IV]
Local node    : 1
dir_node      : 1
master_node   : 1
hv idx        : 35
hv last r.inc : 6
current inc   : 36
hv status     : 0
hv master     : 1
open options  :
grant_bits    : KJUSERNL KJUSERPR
grant mode    : KJUSERNL  KJUSERCR  KJUSERCW  KJUSERPR  KJUSERPW  KJUSEREX
count         : 114         0         0         333         0         0
val_state     : KJUSERVS_VALUE
valblk        : 0x000000020001fcc00000000000000000 .
access_node   : 1
vbreq_state   : 0
state         : x8
resp          : 0xc000000094eeda88
On Scan_q?    : N
Total accesses: 4524902
Imm.  accesses: 1948489
Granted_locks : 333
Cvting_locks  : 114
总的来说，给人的感觉似乎lck是罪魁祸首，其实不竟然。
用awk脚本格式化了这个systemstate dump后，得到下面的结果：
System State 1
~~~~~~~~~~~~~~~~
1:
2:  waiting for 'pmon timer'            wait
3:  waiting for 'DIAG idle wait'        wait
4:  waiting for 'rdbms ipc message'     wait
5:  waiting for 'rdbms ipc message'     wait
6:  waiting for 'ges remote message'    wait
7:  waiting for 'gcs remote message'    wait
8:  waiting for 'gcs remote message'    wait
9:  waiting for 'rdbms ipc message'     wait
10: waiting for 'rdbms ipc message'     wait
11: waiting for 'rdbms ipc message'     wait
12: waiting for 'rdbms ipc message'     wait
13: waiting for 'smon timer'            wait
14: waiting for 'rdbms ipc message'     wait
15: waiting for 'rdbms ipc message'     wait
16: waiting for 'rdbms ipc message'     wait
17: waiting for 'rdbms ipc message'     wait
18:
19:
20: waiting for 'rdbms ipc message'     wait
21: waiting for 'SQL*Net message from client' wait
22: waiting for 'rdbms ipc message'     wait
23: waiting for 'rdbms ipc message'     wait
24: waiting for 'SQL*Net message from client' wait
25: waiting for 'Streams AQ: qmn coordinator idle wait' wait
26: waiting for 'SQL*Net message from client' wait
27: waiting for 'SQL*Net message from client' wait
Cmd: PL/SQL Execute
28: for 'Streams AQ: waiting for time management or cleanup tasks' wait
29:
30: waiting for 'SQL*Net message from client' wait
31: waiting for 'SQL*Net message from client' wait
32: waiting for 'SQL*Net message from client' wait
33: waiting for 'SQL*Net message from client' wait
34: waiting for 'SQL*Net message from client' wait
35: waiting for 'SQL*Net message from client' wait
36: waiting for 'SQL*Net message from client' wait
37: waiting for 'SQL*Net message from client' wait
38: waiting for 'SQL*Net message from client' wait
39: waiting for 'SQL*Net message from client' wait
Cmd: Insert
40: waiting for 'SQL*Net message from client' wait
41: waiting for 'SQL*Net message from client' wait
42: last wait for 'SQL*Net message from client'
43: waiting for 'SQL*Net message from client' wait
45: waiting for 'Streams AQ: qmn slave idle wait' wait
46: waiting for 'SQL*Net message from client' wait
Cmd: Insert
47: waiting for 'SQL*Net message from client' wait
48: waiting for 'SQL*Net message from client' wait
49: waiting for 'SQL*Net message from client' wait
50: waiting for 'SQL*Net message from client' wait
52: waiting for 'SQL*Net message from client' wait
53: waiting for 'SQL*Net message from client' wait
54: waiting for 'SQL*Net message from client' wait
55: waiting for 'SQL*Net message from client' wait
56: waiting for 'SQL*Net message from client' wait
58: waiting for 'SQL*Net message from client' wait
59: waiting for 'SQL*Net message from client' wait
60: waiting for 'SQL*Net message from client' wait
61: waiting for 'SQL*Net message from client' wait
65: waiting for 'SQL*Net message from client' wait
66: waiting for 'SQL*Net message from client' wait
90: waiting for 'SQL*Net message from client' wait
91: waiting for 'SQL*Net message from client' wait
92: waiting for 'SQL*Net message from client' wait
93: waiting for 'SQL*Net message from client' wait
94: waiting for 'SQL*Net message from client' wait
95: waiting for 'SQL*Net message from client' wait
96: waiting for 'SQL*Net message from client' wait
106:waiting for 'SQL*Net message from client' wait
107:waiting for 'SQL*Net message from client' wait
108:waiting for 'SQL*Net message from client' wait
109:waiting for 'SQL*Net message from client' wait
110:waiting for 'SQL*Net message from client' wait
111:waiting for 'SQL*Net message from client' wait
112:waiting for 'SQL*Net message from client' wait
113:waiting for 'SQL*Net message from client' wait
114:waiting for 'SQL*Net message from client' wait
115:waiting for 'SQL*Net message from client' wait
116:waiting for 'SQL*Net message from client' wait
117:waiting for 'SQL*Net message from client' wait
120:waiting for 'SQL*Net message from client' wait
NO BLOCKING PROCESSES FOUND
881983 Lines Processed.
从分析结果来看，断定pmon进程才是真正的罪魁祸首，搜索pmon关键字:
PROCESS 2:
—————————————-
SO: c0000000ad68c020, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: c0000000adab8098/c0000000adab8098, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 48
last post received-location: ksoreq_reply
last process to post me: c0000000ad692750 4 2
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c0000000ad68d000 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c0000000ad784f20
O/S info: user: oracle, term: UNKNOWN, ospid: 27626
OSD pid info: Unix process pid: 27626, image: oracle@uamdb2 (PMON)
###### 省略部分信息 ######
—————————————-
SO: c0000000add586a8, type: 11, owner: c0000000ad68c020, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad68c020,
event: 1, last message event: 1,
last message waited event: 1, messages read: 0
channel: (c0000000add7e4b0) scumnt mount lock
scope: 1, event: 15, last mesage event: 0,
publishers/subscribers: 0/14,
messages published: 0
—————————————-
SO: c0000000ada73a90, type: 4, owner: c0000000ad68c020, flag: INIT/-/-/0x00
(session) sid: 555 trans: 0000000000000000, creator: c0000000ad68c020, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0002-0002-00000004, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'pmon timer' blocking sess=0x0000000000000000 seq=50 wait_time=0 seconds since wait started=75789
duration=12c, =0, =0
Dumping Session Wait History
for 'pmon timer' count=1 wait_time=2929152
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929462
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929351
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929175
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2928083
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929290
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929509
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929262
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929283
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929525
duration=12c, =0, =0
temporary object counter: 0

2929152/3600/24=33天左右。换句话说，pmon进程在1个月前就出现异常了，同事也说5月26号的时候重启过，
后来似乎就有问题了，掐指一算，时间似乎刚刚对上。
最后我在中午申请了一点停机时间，我们进行了处理，首先我kill了diag进程，12g的磁盘空间得到了释放，
最开始我断定是lck进程导致的，于是kill了该进程，结果直接导致该节点重启了，不过还好的是，23s就重启完成了。

CASE 4.

这个是昨天的一天case，某个客户方工程师的一套hp 9204系统（HA）环境，记得是我以前部署的。
客户工程师给listener设置密码以后，用脚本来启动监听，发现报错，非常的奇怪，信息如下：
TNS-01169: The listener has not recognized the password
当时他的脚本如下：

su - oracle<


但看脚本，似乎没啥问题，最后我才发现是有问题，那也是折腾1个小时之后才发现的，几个同事一起看，都没发现问题的所在。

老实说，我也偶然发现的，记得快下班的时候，想起了老白的一句话: 像oracle一样思考

想到这句话，突然另我茅塞顿开，我当时是这样认为的：

==> 既然我们要对listener设置密码，那么必然是为了安全性；

==> 通过set password roger 加到脚本中来启动，要是有读权限的用户不就能看到密码了吗？那还谈何安全性呢？

很明显，这里就自相矛盾了；

于是我尝试将脚本修改为如下：


su - oracle<


再次执行，发现居然ok了，最后我又想，既然是要输入加密后的密码而不是明文，那么metalink文档应该有明确的说明啊；

记得在开始遇到这个问题之后，我也看了metalink文档 

How To Start/Stop A Password Protected Listener From A Script [ID 361919.1]

但是只是扫了一眼，没有引起重视，最后解决问题以后，我再次看了下，原来如此:

set password encrypted_password -- 其实文档是明确说了，要输入加密后的密码

这当然说明了我当时太马虎了，不过也确实容易忽略。

补充一下，如果在10g中进行类似的测试，本机操作需要设置参数LOCAL_OS_AUTHENTICATION_为off。

ora-600 [kcbzpb_1] [kdsgrp1] listener encrypted_password

惜分飞

2011年7月3日

roger:
我给listener设置了密码，
PASSWORDS_LISTENER = 6D7AA003392C436A（明文为：123456）
然后使用
lsnrctl<>listener.log
set current_listener listener
set password 123456
status
EOF
测试成功
修改为：
lsnrctl<>listener.log
set current_listener listener
set password 6D7AA003392C436A
status
EOF
提示：
TNS-01169: The listener has not recognized the password
和你说的结果相反

回复
admin

2011年7月3日

oracle 版本？

回复
Bradford Cerbone

2011年7月19日

Have you seen this new plugin from online marketing genius Daniel Tan known as SEOPressor ? I have already been implementing it for the past couple weeks and I now have five of my new blogs on the first page of google for highly competetive terms! I would suggest you try it out here: SEOPressor

回复
Eduardo Hoisl

2011年7月19日

Impressive blog, the only thing I noticed is your on-page SEO might use just a bit of work and then you would definitely be achieving a high position in google in no time! Have you used the SEOPressor plugin? It’s extremely effective and user friendly. Download it here. Just thought I’d share. Good luck to you.

回复
Earlene Pandy

2011年11月7日

why dont you add a mailing list to your blog?

回复

最近一周遇到的几个case

评论

发表回复取消回复

最近一周遇到的几个case

评论

发表回复 取消回复

发表回复取消回复