案例:troubleshooting error ORA-12805 when using parallel

本案例来自南区某客户,环境为exadata,数据库版本rac 11.2.0.4。在并行创建索引时,大概执行5分钟之后就会报错ORA-12805,据现场人员描述,无论并行度指定为多少,都会报错。

mos中有一篇文档How to Diagnose ORA-12805 Parallel Query Server Died Unexpectedly Error (Doc ID 1348002.1) ,详细记录了ORA-12805的诊断方法。

alter session set MAX_DUMP_FILE_SIZE= unlimited;
alter session set  TRACEFILE_IDENTIFIER  = 'ORA12805'
alter session set "_px_trace"=medium, execution, medium, messaging, time;
alter system set  events = '12805 trace name ERRORSTACK level 3';

设置px trace和12805的errorstack之后,复现错误,分析报错的trace文件。

可以找到报错的语句为:

*** 2023-01-21 22:30:04.803
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=3, mask=0x0)
----- Error Stack Dump -----
ORA-12805: parallel query server died unexpectedly
----- Current SQL Statement for this session (sql_id=dm52tctzjbpak) -----
create unique index mesmgr.CWIPASYINV_PKP on MESMGR.CWIPASYINV (FACTORY, LOT_ID, INV_LOT_ID, BOARD_SIDES) tablespace SMT_TS_IDX parallel 8
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst()+41        call     kgdsdst()            000000000 ? 000000000 ?
7FFDD43ED2E0 ? 7FFDD43ED3B8 ?
7FFDD43F1E60 ? 000000002 ?
ksedst1()+103        call     skdstdst()           000000000 ? 000000000 ?
7FFDD43ED2E0 ? 7FFDD43ED3B8 ?
7FFDD43F1E60 ? 000000002 ?
ksedst()+39          call     ksedst1()            000000000 ? 000000001 ?
7FFDD43ED2E0 ? 7FFDD43ED3B8 ?
7FFDD43F1E60 ? 000000002 ?
dbkedDefDump()+2746  call     ksedst()             000000000 ? 000000001 ?
7FFDD43ED2E0 ? 7FFDD43ED3B8 ?
7FFDD43F1E60 ? 000000002 ?
ksedmp()+41          call     dbkedDefDump()       000000003 ? 000000000 ?
7FFDD43ED2E0 ? 7FFDD43ED3B8 ?
7FFDD43F1E60 ? 000000002 ?
dbkdaKsdActDriver()  call     ksedmp()             000000003 ? 000000000 ?
+1960                                              7FFDD43ED2E0 ? 7FFDD43ED3B8 ?
7FFDD43F1E60 ? 000000002 ?
dbgdaExecuteAction(  call     dbkdaKsdActDriver()  7F351CF04730 ? 7FFDD43F4560 ?
)+1065                                             7FFDD43ED2E0 ? 7FFDD43ED3B8 ?
7FFDD43F1E60 ? 000000002 ?
dbgdaRunAction()+81  call     dbgdaExecuteAction(  7F351CF04730 ? 00A80CE00 ?
5                             )                    0020C0003 ? 7FFDD43F4560 ?
000000001 ? 000000002 ?
dbgdRunActions()+64  call     dbgdaRunAction()     7F351CF04730 ? 000000005 ?
0020C0003 ? 7FFDD43F4560 ?
000000001 ? 000000002 ?
dbgdProcessEventAct  call     dbgdRunActions()     7F351CF04730 ? 000000005 ?
ions()+656                                         0020C0003 ? 7FFDD43F4560 ?
000000001 ? 000000002 ?
dbgdChkEventKgErr()  call     dbgdProcessEventAct  7F351CF04730 ? 00C35D8C0 ?
+1658                         ions()               7F3519B005A0 ? 7FFDD43F4560 ?
000000001 ? 000000002 ?
dbkdChkEventRdbmsEr  call     dbgdChkEventKgErr()  7F351CF04730 ? 00C35D8C0 ?
r()+61                                             7F3519B005A0 ? 7FFDD43F4560 ?
000000001 ? 000000002 ?
ksfpec()+66          call     dbkdChkEventRdbmsEr  7F351CF04730 ? 00C35D8C0 ?
r()                  7F3519B005A0 ? 7FFDD43F4560 ?
000000001 ? 000000002 ?
dbgePostErrorKGE()+  call     ksfpec()             7F351CF04730 ? 00C35D8C0 ?
1137                                               7F3519B005A0 ? 7FFDD43F4560 ?
000000001 ? 000000002 ?
dbkePostKGE_kgsf()+  call     dbgePostErrorKGE()   00C35D8C0 ? 7F351CF0D9F0 ?
71                                                 000003205 ? 7FFDD43F4560 ?
000000001 ? 000000002 ?
kgeselv()+276        call     dbkePostKGE_kgsf()   00C35D8C0 ? 7F351CF0D9F0 ?
000003205 ? 7FFDD43F4560 ?
000000001 ? 000000002 ?
ksesecl0()+162       call     kgeselv()            00C35D8C0 ? 7F351CF0D9F0 ?
000003205 ? 7FFDD43F4560 ?
000000001 ? 000000000 ?
ksucin()+147         call     ksesecl0()           00C35D8C0 ? 7F351CF0D9F0 ?
000003205 ? 000000000 ?
0000000B0 ? 000000C51 ?
kxfpqidqr()+14184    call     ksucin()             00C35D8C0 ? 7F351CF0D9F0 ?
000003205 ? 000000000 ?
0000000B0 ? 000000C51 ?
kxfpqdqr()+367       call     kxfpqidqr()          00C35D8C0 ? 7F351CF0D9F0 ?
000003205 ? 000000000 ?
0000000B0 ? 000000C51 ?
kxfxgs()+82          call     kxfpqdqr()           138FFD8120 ? 000000000 ?
00000001B ? 000000000 ?
0000000B0 ? 000000C51 ?
kxfxcw()+281         call     kxfxgs()             000000000 ? 000000000 ?
7FFDD43F6060 ? 000000000 ?
0000000B0 ? 000000C51 ?
qerpxFetch()+2368    call     kxfxcw()             7F3519B12B20 ? 7FFDD43F6060 ?
7FFDD43F6058 ? 000000000 ?
000000000 ? 7FFD00000001 ?
kdicrws()+2005       call     qerpxFetch()         7F3519B12B20 ? 7FFDD43F6060 ?
7FFDD43F6058 ? 000000000 ?
000000000 ? 7FFD00000001 ?
kdicdrv()+492        call     kdicrws()            7F3519B12B20 ? 7FFDD43F6060 ?
7FFDD43F6058 ? 000000000 ?
000000000 ? 7FFD00000001 ?
opiexe()+22672       call     kdicdrv()            12CCEF1088 ? 7FFDD43F6060 ?
7FFDD43F84A0 ? 000000000 ?
000000000 ? 7FFD00000001 ?
opiosq0()+4310       call     opiexe()             000000004 ? 000000000 ?
7FFDD43F84A0 ? 000000000 ?
000000000 ? 7FFD00000001 ?
kpooprx()+274        call     opiosq0()            000000003 ? 00000000E ?
7FFDD43F8EF0 ? 0000000A4 ?
000000000 ? 7FFD00000001 ?
kpoal8()+842         call     kpooprx()            7FFDD43FC6D4 ? 7FFDD43FA690 ?
00000008A ? 000000001 ?
000000000 ? 7FFD00000001 ?
opiodr()+915         call     kpoal8()             00000005E ? 7FFDD43FA690 ?
00000008A ? 000000001 ?
000000000 ? 7FFD00000001 ?
ttcpip()+2773        call     opiodr()             00000005E ? 00000001C ?
7FFDD43FC6D0 ? 000000001 ?
000000000 ? 7FFD00000001 ?
opitsk()+1705        call     ttcpip()             00C37B510 ? 009B99F90 ?
7FFDD43FC6D0 ? 000000000 ?
7FFDD43FC128 ? 7FFDD43FC6CC ?
opiino()+969         call     opitsk()             00C37B518 ? 000000000 ?
7FFDD43FC6D0 ? 000000000 ?
7FFDD43FC128 ? 7FFDD43FC6CC ?
opiodr()+915         call     opiino()             00000003C ? 000000004 ?
7FFDD43FDEC8 ? 000000000 ?
7FFDD43FC128 ? 7FFDD43FC6CC ?
opidrv()+570         call     opiodr()             00000003C ? 000000004 ?
7FFDD43FDEC8 ? 000000000 ?
7FFDD43FC128 ? 7FFDD43FC6CC ?
sou2o()+103          call     opidrv()             00000003C ? 000000004 ?
7FFDD43FDEC8 ? 000000000 ?
7FFDD43FC128 ? 7FFDD43FC6CC ?
opimai_real()+133    call     sou2o()              7FFDD43FDEA0 ? 00000003C ?
000000004 ? 7FFDD43FDEC8 ?
7FFDD43FC128 ? 7FFDD43FC6CC ?
ssthrdmain()+265     call     opimai_real()        000000002 ? 7FFDD43FE090 ?
000000004 ? 7FFDD43FDEC8 ?
7FFDD43FC128 ? 7FFDD43FC6CC ?
main()+201           call     ssthrdmain()         000000002 ? 7FFDD43FE090 ?
000000001 ? 000000000 ?
7FFDD43FC128 ? 7FFDD43FC6CC ?
__libc_start_main()  call     main()               000000002 ? 7FFDD43FE238 ?
+245                                               000000001 ? 000000000 ?
7FFDD43FC128 ? 7FFDD43FC6CC ?
_start()+41          call     __libc_start_main()  000A9F990 ? 000000002 ?
7FFDD43FE228 ? 000000000 ?
7FFDD43FC128 ? 7FFDD43FC6CC ?

继续分析qc进程的trace文件还可以找到比较关键的信息:

2023-01-21 22:30:06.043945 :PX_Messaging:kxfp.c@3324:kxfpqsod(begin):
q=0x138ffd8120 qser=857089 action=0x2 flg=0x4402
2023-01-21 22:30:06.043965 :PX_Messaging:kxfp.c@2921:kxfpqsod_qc_sod(): 	q=0x138ffd8120 qser=857089 qinc=0 action=0x2 flg=0x4402
Process received 12805 signal due to P019, instance dm03dbadm01.adc.com:SMTDB1 (1)

qc进程报错ORA-12805的原因是因为px slave进程P019报错了。

继续分析P019进程的trace,发现报错为28

*** 2023-01-21 22:30:04.733
2023-01-21 22:30:04.733287 :PX_Control:kxfx.c@7901:kxfxmai(): In signal-handler after catching 28
2023-01-21 22:30:04.733329 :PX_Control:kxfx.c@4404:kxfxsuf(): kxfx slave finish
kxfxsSmtClose                                                  [  317370/317280]
stmt:0x7fc04bc51908
UGA: InUse:959264KB Alloc:959264KB
PGA: InUse:2190KB Alloc:2375KB
Call 0x7fc04bc51258 popped
2023-01-21 22:30:04.803193 :PX_Messaging:kxfp.c@12411:kxfprdp():
Exiting parallel client parallel query execution(4) on error=28
[oracle@rac1 ~]$ oerr ora 28
00028, 00000, "your session has been killed"
// *Cause:  A privileged user has killed your session and you are no longer
//          logged on to the database.
// *Action: Login again if you wish to continue working.

28的意思是session被kill了,难道px slave被kill了?

疑似Bug 23717927 : ORA-12805: PARALLEL QUERY SERVER DIED UNEXPECTEDLY

Receive a 12805 signal from a slave process. eg. Slave get err=28 and then
send 12805 to QC

 

 


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注