本案例来自南区某客户,环境为exadata,数据库版本rac 11.2.0.4。在并行创建索引时,大概执行5分钟之后就会报错ORA-12805,据现场人员描述,无论并行度指定为多少,都会报错。
mos中有一篇文档How to Diagnose ORA-12805 Parallel Query Server Died Unexpectedly Error (Doc ID 1348002.1) ,详细记录了ORA-12805的诊断方法。
alter session set MAX_DUMP_FILE_SIZE= unlimited; alter session set TRACEFILE_IDENTIFIER = 'ORA12805' alter session set "_px_trace"=medium, execution, medium, messaging, time; alter system set events = '12805 trace name ERRORSTACK level 3';
设置px trace和12805的errorstack之后,复现错误,分析报错的trace文件。
可以找到报错的语句为:
*** 2023-01-21 22:30:04.803 dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=3, mask=0x0) ----- Error Stack Dump ----- ORA-12805: parallel query server died unexpectedly ----- Current SQL Statement for this session (sql_id=dm52tctzjbpak) ----- create unique index mesmgr.CWIPASYINV_PKP on MESMGR.CWIPASYINV (FACTORY, LOT_ID, INV_LOT_ID, BOARD_SIDES) tablespace SMT_TS_IDX parallel 8 ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- skdstdst()+41 call kgdsdst() 000000000 ? 000000000 ? 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? ksedst1()+103 call skdstdst() 000000000 ? 000000000 ? 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? ksedst()+39 call ksedst1() 000000000 ? 000000001 ? 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? dbkedDefDump()+2746 call ksedst() 000000000 ? 000000001 ? 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? ksedmp()+41 call dbkedDefDump() 000000003 ? 000000000 ? 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? dbkdaKsdActDriver() call ksedmp() 000000003 ? 000000000 ? +1960 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? dbgdaExecuteAction( call dbkdaKsdActDriver() 7F351CF04730 ? 7FFDD43F4560 ? )+1065 7FFDD43ED2E0 ? 7FFDD43ED3B8 ? 7FFDD43F1E60 ? 000000002 ? dbgdaRunAction()+81 call dbgdaExecuteAction( 7F351CF04730 ? 00A80CE00 ? 5 ) 0020C0003 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbgdRunActions()+64 call dbgdaRunAction() 7F351CF04730 ? 000000005 ? 0020C0003 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbgdProcessEventAct call dbgdRunActions() 7F351CF04730 ? 000000005 ? ions()+656 0020C0003 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbgdChkEventKgErr() call dbgdProcessEventAct 7F351CF04730 ? 00C35D8C0 ? +1658 ions() 7F3519B005A0 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbkdChkEventRdbmsEr call dbgdChkEventKgErr() 7F351CF04730 ? 00C35D8C0 ? r()+61 7F3519B005A0 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? ksfpec()+66 call dbkdChkEventRdbmsEr 7F351CF04730 ? 00C35D8C0 ? r() 7F3519B005A0 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbgePostErrorKGE()+ call ksfpec() 7F351CF04730 ? 00C35D8C0 ? 1137 7F3519B005A0 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? dbkePostKGE_kgsf()+ call dbgePostErrorKGE() 00C35D8C0 ? 7F351CF0D9F0 ? 71 000003205 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? kgeselv()+276 call dbkePostKGE_kgsf() 00C35D8C0 ? 7F351CF0D9F0 ? 000003205 ? 7FFDD43F4560 ? 000000001 ? 000000002 ? ksesecl0()+162 call kgeselv() 00C35D8C0 ? 7F351CF0D9F0 ? 000003205 ? 7FFDD43F4560 ? 000000001 ? 000000000 ? ksucin()+147 call ksesecl0() 00C35D8C0 ? 7F351CF0D9F0 ? 000003205 ? 000000000 ? 0000000B0 ? 000000C51 ? kxfpqidqr()+14184 call ksucin() 00C35D8C0 ? 7F351CF0D9F0 ? 000003205 ? 000000000 ? 0000000B0 ? 000000C51 ? kxfpqdqr()+367 call kxfpqidqr() 00C35D8C0 ? 7F351CF0D9F0 ? 000003205 ? 000000000 ? 0000000B0 ? 000000C51 ? kxfxgs()+82 call kxfpqdqr() 138FFD8120 ? 000000000 ? 00000001B ? 000000000 ? 0000000B0 ? 000000C51 ? kxfxcw()+281 call kxfxgs() 000000000 ? 000000000 ? 7FFDD43F6060 ? 000000000 ? 0000000B0 ? 000000C51 ? qerpxFetch()+2368 call kxfxcw() 7F3519B12B20 ? 7FFDD43F6060 ? 7FFDD43F6058 ? 000000000 ? 000000000 ? 7FFD00000001 ? kdicrws()+2005 call qerpxFetch() 7F3519B12B20 ? 7FFDD43F6060 ? 7FFDD43F6058 ? 000000000 ? 000000000 ? 7FFD00000001 ? kdicdrv()+492 call kdicrws() 7F3519B12B20 ? 7FFDD43F6060 ? 7FFDD43F6058 ? 000000000 ? 000000000 ? 7FFD00000001 ? opiexe()+22672 call kdicdrv() 12CCEF1088 ? 7FFDD43F6060 ? 7FFDD43F84A0 ? 000000000 ? 000000000 ? 7FFD00000001 ? opiosq0()+4310 call opiexe() 000000004 ? 000000000 ? 7FFDD43F84A0 ? 000000000 ? 000000000 ? 7FFD00000001 ? kpooprx()+274 call opiosq0() 000000003 ? 00000000E ? 7FFDD43F8EF0 ? 0000000A4 ? 000000000 ? 7FFD00000001 ? kpoal8()+842 call kpooprx() 7FFDD43FC6D4 ? 7FFDD43FA690 ? 00000008A ? 000000001 ? 000000000 ? 7FFD00000001 ? opiodr()+915 call kpoal8() 00000005E ? 7FFDD43FA690 ? 00000008A ? 000000001 ? 000000000 ? 7FFD00000001 ? ttcpip()+2773 call opiodr() 00000005E ? 00000001C ? 7FFDD43FC6D0 ? 000000001 ? 000000000 ? 7FFD00000001 ? opitsk()+1705 call ttcpip() 00C37B510 ? 009B99F90 ? 7FFDD43FC6D0 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? opiino()+969 call opitsk() 00C37B518 ? 000000000 ? 7FFDD43FC6D0 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? opiodr()+915 call opiino() 00000003C ? 000000004 ? 7FFDD43FDEC8 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? opidrv()+570 call opiodr() 00000003C ? 000000004 ? 7FFDD43FDEC8 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? sou2o()+103 call opidrv() 00000003C ? 000000004 ? 7FFDD43FDEC8 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? opimai_real()+133 call sou2o() 7FFDD43FDEA0 ? 00000003C ? 000000004 ? 7FFDD43FDEC8 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? ssthrdmain()+265 call opimai_real() 000000002 ? 7FFDD43FE090 ? 000000004 ? 7FFDD43FDEC8 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? main()+201 call ssthrdmain() 000000002 ? 7FFDD43FE090 ? 000000001 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? __libc_start_main() call main() 000000002 ? 7FFDD43FE238 ? +245 000000001 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ? _start()+41 call __libc_start_main() 000A9F990 ? 000000002 ? 7FFDD43FE228 ? 000000000 ? 7FFDD43FC128 ? 7FFDD43FC6CC ?
继续分析qc进程的trace文件还可以找到比较关键的信息:
2023-01-21 22:30:06.043945 :PX_Messaging:kxfp.c@3324:kxfpqsod(begin): q=0x138ffd8120 qser=857089 action=0x2 flg=0x4402 2023-01-21 22:30:06.043965 :PX_Messaging:kxfp.c@2921:kxfpqsod_qc_sod(): q=0x138ffd8120 qser=857089 qinc=0 action=0x2 flg=0x4402 Process received 12805 signal due to P019, instance dm03dbadm01.adc.com:SMTDB1 (1)
qc进程报错ORA-12805的原因是因为px slave进程P019报错了。
继续分析P019进程的trace,发现报错为28
*** 2023-01-21 22:30:04.733 2023-01-21 22:30:04.733287 :PX_Control:kxfx.c@7901:kxfxmai(): In signal-handler after catching 28 2023-01-21 22:30:04.733329 :PX_Control:kxfx.c@4404:kxfxsuf(): kxfx slave finish kxfxsSmtClose [ 317370/317280] stmt:0x7fc04bc51908 UGA: InUse:959264KB Alloc:959264KB PGA: InUse:2190KB Alloc:2375KB Call 0x7fc04bc51258 popped 2023-01-21 22:30:04.803193 :PX_Messaging:kxfp.c@12411:kxfprdp(): Exiting parallel client parallel query execution(4) on error=28
[oracle@rac1 ~]$ oerr ora 28 00028, 00000, "your session has been killed" // *Cause: A privileged user has killed your session and you are no longer // logged on to the database. // *Action: Login again if you wish to continue working.
28的意思是session被kill了,难道px slave被kill了?
疑似Bug 23717927 : ORA-12805: PARALLEL QUERY SERVER DIED UNEXPECTEDLY
Receive a 12805 signal from a slave process. eg. Slave get err=28 and then send 12805 to QC
发表回复