Linux信号量设置不当导致Sys cpu%过高(Oracle 19c版本)

近期某客户的环境出现了不正常的一幕,Linux sys% cpu消耗过高,高峰期间甚至高达30%+,比usr%还要高。

09:52:31:130[root@dbxxxx12 ~]# dstat -cldsnmy
09:52:31:327----total-cpu-usage---- ---load-avg--- -dsk/total- ----swap--- -net/total- ------memory-usage----- ---system--
09:52:31:328usr sys idl wai hiq siq| 1m   5m  15m | read  writ| used  free| recv  send| used  buff  cach  free| int   csw
09:52:32:331  4   3  93   0   0   0| 122  121  114|  92M   52M|  57M   16G|   0     0 | 668G  865M  196G  142G| 103k  126k
09:52:33:332 42  22  35   0   0   1| 122  121  114| 307M   39M|  57M   16G| 176M  237M| 668G  865M  196G  143G| 510k  211k
09:52:34:327 42  25  32   0   0   1| 129  122  114| 269M   39M|  57M   16G| 186M  230M| 666G  865M  196G  144G| 511k  205k
09:52:35:331 44  23  32   0   0   1| 129  122  114| 218M   73M|  57M   16G| 198M  231M| 666G  865M  196G  144G| 536k  226k
09:52:36:011 41  19  39   0   0   1| 129  122  114| 243M   74M|  57M   16G| 191M  245M| 666G  865M  196G  144G| 513k  223k

在早高峰到来之前可以看到sys高达25,这是不正常的。在后续的分析过程中通过多次top 抓取发现大量的scmn进程消耗cpu过多:

10:38:46:790top - 10:38:46 up 106 days, 16:30, 14 users,  load average: 135.09, 132.80, 118.06
10:38:46:790Tasks: 9641 total,  96 running, 9541 sleeping,   2 stopped,   2 zombie
10:38:46:791%Cpu(s): 18.4 us, 33.2 sy,  0.0 ni, 47.1 id,  0.4 wa,  0.0 hi,  0.8 si,  0.0 st
10:38:46:791KiB Mem : 10561102+total, 15898280+free, 68324761+used, 21387985+buff/cache
10:38:46:791KiB Swap: 16777212 total, 16718588 free,    58624 used. 34448185+avail Mem
10:38:46:791
10:38:46:792   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10:38:46:792 63161 grid      20   0   13.9g 232460  25268 S 244.1  0.0   2235:19 java
10:38:46:793204947 oracle    20   0  511.8g 146492  85788 S 104.6  0.0 795:44.23 ora_scmn_scsbgj
10:38:46:793 42759 oracle    20   0  514.5g  69600  49480 R 100.0  0.0  18:27.64 ora_p012_scsbgj
10:38:46:794 52139 oracle    20   0  514.5g  45440  33828 R 100.0  0.0   9:57.99 oracle_52139_sc
10:38:46:794 50493 oracle    20   0  514.5g  66764  50092 R  99.7  0.0  25:01.16 oracle_50493_sc
10:38:46:794 66749 oracle    20   0  514.6g   1.1g  45736 R  99.7  0.1  56:52.12 oracle_66749_sc
10:38:46:795138507 oracle    20   0  514.5g  65384  46368 R  99.7  0.0   6:11.54 oracle_138507_s
10:38:46:795168126 oracle    20   0  514.5g  45352  33796 R  99.7  0.0   4:45.57 oracle_168126_s
10:38:46:795 42757 oracle    20   0  514.6g  72864  50128 R  99.4  0.0  15:36.45 ora_p011_scsbgj
10:38:46:796110597 oracle    20   0  515.6g   1.1g  45884 R  99.4  0.1  44:27.95 oracle_110597_s
10:38:46:796200607 oracle    20   0  511.5g  59424  44772 R  99.4  0.0  87:29.93 oracle_200607_s
10:38:46:796213135 oracle    20   0  508.6g  82416  58100 R  99.4  0.0 161:10.08 ora_p00a_scsbgj
10:38:46:821213139 oracle    20   0  504.6g  82072  56868 R  99.4  0.0 171:51.14 ora_p00c_scsbgj
10:38:46:822133223 oracle    20   0  514.5g  53288  39532 R  98.8  0.0   0:29.17 oracle_133223_s
10:38:46:822 42755 oracle    20   0  514.6g  74040  49984 R  98.1  0.0  15:12.97 ora_p010_scsbgj
10:38:46:822104726 oracle    20   0  514.5g 101796  47080 R  94.8  0.0   2:12.13 oracle_104726_s
10:38:46:822 14575 oracle    20   0  514.5g  67668  48688 R  94.1  0.0   0:55.08 oracle_14575_sc
10:38:46:822204884 oracle    20   0  511.8g 145732  85564 S  91.0  0.0 798:03.08 ora_scmn_scsbgj
10:38:46:822204841 oracle    20   0  511.8g 145568  85424 S  88.9  0.0 789:03.39 ora_scmn_scsbgj
10:38:46:823149469 oracle    20   0  514.5g  62060  44872 R  88.6  0.0  38:57.63 oracle_149469_s
10:38:46:823204853 oracle    20   0  511.8g 146232  85672 S  88.3  0.0 832:18.07 ora_scmn_scsbgj
10:38:46:823204890 oracle    20   0  511.8g 146084  85848 S  88.0  0.0 799:14.89 ora_scmn_scsbgj
10:38:46:823 91621 oracle    20   0  514.5g  45576  33972 R  87.7  0.0   7:40.24 oracle_91621_sc
10:38:46:823204803 oracle    20   0  511.8g 145720  85660 S  87.0  0.0 802:39.04 ora_scmn_scsbgj
10:38:46:823204799 oracle    20   0  511.8g 145916  85872 S  86.7  0.0 796:50.94 ora_scmn_scsbgj
10:38:46:824204823 oracle    20   0  511.8g 145956  85900 S  86.7  0.0 802:19.08 ora_scmn_scsbgj
10:38:46:824204933 oracle    20   0  511.8g   1.1g   1.1g S  86.4  0.1 807:57.54 ora_scmn_scsbgj
10:38:46:824204914 oracle    20   0  510.8g 146160  85684 S  85.8  0.0 793:21.89 ora_scmn_scsbgj
10:38:46:824204817 oracle    20   0  511.8g 146728  86156 S  85.2  0.0 797:27.14 ora_scmn_scsbgj
10:38:46:824204905 oracle    20   0  511.8g 146144  85848 S  84.9  0.0 795:12.41 ora_scmn_scsbgj
10:38:46:824204968 oracle    20   0  511.8g   1.1g   1.1g S  84.9  0.1 810:06.94 ora_scmn_scsbgj
10:38:46:825157837 oracle    20   0  514.5g  54372  40712 R  83.6  0.0   2:37.20 oracle_157837_s
10:38:46:825204797 oracle    20   0  511.8g 145416  85696 S  83.0  0.0 792:29.84 ora_scmn_scsbgj
10:38:46:825204850 oracle    20   0  511.8g 146036  85804 S  83.0  0.0 796:05.20 ora_scmn_scsbgj
10:38:46:825204801 oracle    20   0  511.8g 146068  85764 S  82.7  0.0 796:12.43 ora_scmn_scsbgj
10:38:46:825204861 oracle    20   0  511.8g 145956  85652 S  82.7  0.0 794:52.26 ora_scmn_scsbgj
10:38:46:825204897 oracle    20   0  511.8g 146024  85976 S  82.7  0.0 821:30.15 ora_scmn_scsbgj
10:38:46:825204828 oracle    20   0  511.8g 146712  86408 S  82.1  0.0 802:29.78 ora_scmn_scsbgj
10:38:46:826204846 oracle    20   0  511.8g 146112  86128 S  81.5  0.0 796:14.10 ora_scmn_scsbgj

通过perf top可以抓取到相关的堆栈信息:

Overhead  Shared Object       Symbol
  25.59%  [kernel]            [k] native_queued_spin_lock_slowpath
   9.91%  oracle              [.] kcbgtcr
   3.81%  oracle              [.] kaf4reasrp1km
   2.85%  oracle              [.] kaf4reasrp0km
   2.82%  oracle              [.] kdstf110010100000000km
   2.42%  oracle              [.] kcbrls
   1.62%  oracle              [.] qetlbr
   1.16%  [kernel]            [k] _raw_spin_unlock_irqrestore
   1.13%  oracle              [.] kdstf010010100001000km
   1.06%  oracle              [.] kafger
   1.06%  oracle              [.] lnxcpn
   1.06%  oracle              [.] lxkLikeUTF8
   0.93%  oracle              [.] evaopn2
   0.83%  oracle              [.] ktrgcm
   0.80%  oracle              [.] ktrvac
   0.76%  oracle              [.] kjbrfnd
   0.70%  oracle              [.] kcbzar
   0.70%  oracle              [.] qertbFetchByRowID
   0.63%  oracle              [.] __intel_avx_rep_memset
   0.63%  oracle              [.] kcbz_fr_buf
   0.63%  oracle              [.] kdifxs0
   0.63%  oracle              [.] kdstf010010100000000km
   0.63%  oracle              [.] lxsCnvCaseUTF8
   0.60%  [kernel]            [k] __do_softirq
   0.60%  [kernel]            [k] i40e_get_tx_pending
   0.53%  [kernel]            [k] __nf_conntrack_find_get
   0.53%  oracle              [.] evareo
   0.50%  oracle                 [.] kcbz_fr_buf
   0.51%  [kernel]               [k] finish_task_switch

scmn进程本身是Oracle 12c 引入新特性Multi-Threaded architecture of processes 时所带来的新特性,尽管改新特性在19c中默认仍然是关闭的;可以通过如下如下参数设置为true来进行启用:

threaded_execution = true

对于此类新特性,我个人建议暂时先不要使用,毕竟Oracle 默认仍然将其关闭,可见目前并不稳定。

最后根据High SYS CPU Usage ON LMS Thread (SCMN/CR00/RS01) During High Workload (Doc ID 2707048.1) 的描述来看,配合我们后续perf top 抓取的堆栈,基本上是符合的。

最终还是将信号量做了调整;将

kernel.sem =12000 1536000 12000 128
调整为:
kernel.sem =1024  66666  1024  256
从目前来看,该问题仅存在18c+的版本中,至少我们在现有客户环境中12.2环境中没有发现该问题(同样环境压力也很大,process设置也非常高,均超过5000-单节点).
总结:
1、18c+版本,尤其是现在大家使用19c版本,需要注意信号量的设置,并非越大越好;够用即可;否则可能命中sys%消耗过高的问题;
2、经过19c默认并没有启用多线程进程特性,然而部分进程仍然使用了多线程,猜测这是触发该问题的关键。99% 是Bug导致。

评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注