IB驱动问题导致Oracle集群主机重启

某客户分布式存储环境在进行ifdown IB2测试时(Oracle RAC环境有2个心跳网卡;分别是ib0/ib2),发现数据库主机直接crash重启;我们先看看ocssd log:

2020-04-23 18:43:52.380987 :    CSSD:29906688: clssgmpcMemberDataUpdt: grockName HB+ASM memberID 9:2:2, datatype 1 datasize 4
2020-04-23 18:43:52.381176 :    CSSD:23856896: clssgmcpDataUpdtCmpl: Status 0 mbr data updt memberID 9:2:2 from clientID 2:96:4
Trace file /u01/app/12.1/diag/crs/mpbdb2/crs/trace/ocssd.trc
Oracle Database 12c Clusterware Release 12.1.0.2.0 - Production Copyright 1996, 2014 Oracle. All rights reserved.
2020-04-23 18:53:14.996300 :    CSSD:654306816: (TLM) Starting CSS daemon, version 12.1.0.2.0 with uniqueness value 1587639194
2020-04-23 18:53:14.996320 :    CSSD:654306816: clsu_load_ENV_levels: Module = CSSD, LogLevel = 2, TraceLevel = 0
2020-04-23 18:53:14.996329 :    CSSD:654306816: clsu_load_ENV_levels: Module = CSSDNMC, LogLevel = 2, TraceLevel = 0

可以看到18:43:52直接重启了。由于这几套环境我们之前开启了kdump;我将客户的vmcore文件拿到本地进行了简单分析。

root@localhost tmp]# crash /usr/lib/debug/lib/modules/2.6.32-642.el6.x86_64/vmlinux vmcore

crash 7.1.4-1.0.1.el6_7
Copyright (C) 2002-2015  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel version inconsistency between vmlinux and dumpfile

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.el6.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 56
        DATE: Thu Apr 23 18:43:52 2020
      UPTIME: 2 days, 00:31:16
LOAD AVERAGE: 2.44, 2.45, 2.82
       TASKS: 2884
    NODENAME: mpbdb2
     RELEASE: 2.6.32-642.el6.x86_64
     VERSION: #1 SMP Wed Apr 13 00:51:26 EDT 2016
     MACHINE: x86_64  (2593 Mhz)
      MEMORY: 255.6 GB
       PANIC: "kernel BUG at mm/slab.c:524!"
         PID: 47680
     COMMAND: "ip"
        TASK: ffff881fb58a4ab0  [THREAD_INFO: ffff8810dad2c000]
         CPU: 26
       STATE: TASK_RUNNING (PANIC)
crash> files 47680
PID: 47680  TASK: ffff881fb58a4ab0  CPU: 26  COMMAND: "ip"
ROOT: /    CWD: /etc/sysconfig/network-scripts
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 ffff884035ef61c0 ffff881fa4c509c0 ffff881fbbae6108 CHR  /dev/pts/0
  1 ffff884035ef61c0 ffff881fa4c509c0 ffff881fbbae6108 CHR  /dev/pts/0
  2 ffff884050350d80 ffff8820535d9e00 ffff884053150a38 CHR  /dev/null
  3 ffff881983e48d80 ffff8813fc2029c0 ffff88167b81fbc8 SOCK

下面进一步查看堆栈信息:

crash> bt
PID: 47680  TASK: ffff881fb58a4ab0  CPU: 26  COMMAND: "ip"
 #0 [ffff8810dad2f1f0] machine_kexec at ffffffff8103fdcb
 #1 [ffff8810dad2f250] crash_kexec at ffffffff810d1fe2
 #2 [ffff8810dad2f320] oops_end at ffffffff8154bc40
 #3 [ffff8810dad2f350] die at ffffffff8101102b
 #4 [ffff8810dad2f380] do_trap at ffffffff8154b494
 #5 [ffff8810dad2f3e0] do_invalid_op at ffffffff8100cd95
 #6 [ffff8810dad2f480] invalid_op at ffffffff8100c01b
    [exception RIP: kfree+668]                          +++++ exception RIP即为造成错误的指令
    RIP: ffffffff81181b1c  RSP: ffff8810dad2f538  RFLAGS: 00010046
    RAX: ffffea003bea99f0  RBX: ffff88111e752000  RCX: ffff88111e752000
    RDX: 0040000000080000  RSI: 0000000000000046  RDI: ffff88111e752000
    RBP: ffff8810dad2f598   R8: 0000000000000001   R9: ffff8800000bda00
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffffff8146b528
    R13: 0000000000000286  R14: 0000000000000005  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff8810dad2f5a0] skb_release_data at ffffffff8146b528
 #8 [ffff8810dad2f5c0] __kfree_skb at ffffffff8146b05e
 #9 [ffff8810dad2f5e0] consume_skb at ffffffff8146b11b
#10 [ffff8810dad2f600] dev_kfree_skb_any at ffffffff81478e9d
#11 [ffff8810dad2f650] ipoib_ib_dev_stop at ffffffffa060a074 [ib_ipoib]
#12 [ffff8810dad2f670] ipoib_stop at ffffffffa0604c75 [ib_ipoib]
#13 [ffff8810dad2f6a0] dev_close_many at ffffffff81479f15
#14 [ffff8810dad2f6e0] dev_close at ffffffff8147a471
#15 [ffff8810dad2f710] dev_change_flags at ffffffff814794dc
#16 [ffff8810dad2f750] do_setlink at ffffffff81488ca7
#17 [ffff8810dad2f7f0] rtnl_newlink at ffffffff8148a55e
#18 [ffff8810dad2fa00] rtnetlink_rcv_msg at ffffffff81489d77
#19 [ffff8810dad2fa70] netlink_rcv_skb at ffffffff814a6389
#20 [ffff8810dad2faa0] rtnetlink_rcv at ffffffff81489e35
#21 [ffff8810dad2fac0] netlink_unicast at ffffffff814a5faf
#22 [ffff8810dad2fb20] netlink_sendmsg at ffffffff814a6a13
#23 [ffff8810dad2fbb0] sock_sendmsg at ffffffff814634b3
#24 [ffff8810dad2fd60] __sys_sendmsg at ffffffff81464c96
#25 [ffff8810dad2ff10] sys_sendmsg at ffffffff81464eb9
#26 [ffff8810dad2ff80] system_call_fastpath at ffffffff8100b0d2
    RIP: 00000039c9ce9a30  RSP: 00007ffcf739d9c0  RFLAGS: 00010246
    RAX: 000000000000002e  RBX: ffffffff8100b0d2  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 00007ffcf739d9e0  RDI: 0000000000000003
    RBP: 00007ffcf739d9e0   R8: 0000000000000000   R9: 0000000000000000
    R10: 000000000063ad90  R11: 0000000000000246  R12: 0000000000000003
    R13: 00007ffcf73a6270  R14: 0000000000637900  R15: 0000000000000003
    ORIG_RAX: 000000000000002e  CS: 0033  SS: 002b

从上述堆栈来看是执行kfree 回收slab时失败了。我们可以通过crash工具的dis来查看相关报错代码原文件的具体位置:

crash> dis -l ffffffff81181b1c
/usr/src/debug/kernel-2.6.32-642.el6/linux-2.6.32-642.el6.x86_64/mm/slab.c: 524
0xffffffff81181b1c <kfree+668>: ud2
crash>
crash> dis -l ffffffff8146b528
/usr/src/debug/kernel-2.6.32-642.el6/linux-2.6.32-642.el6.x86_64/net/core/skbuff.c: 424
0xffffffff8146b528 <skb_release_data+216>:      pop    %rbx
crash>

接着我们来查看上述2个原文件的524行和424行;看看跟我们的分析是否匹配:

    511 /*
    512  * Functions for storing/retrieving the cachep and or slab from the page
    513  * allocator.  These are used to find the slab an obj belongs to.  With kfree(),
    514  * these are used to find the cache which an obj belongs to.
    515  */
    516 static inline void page_set_cache(struct page *page, struct kmem_cache *cache)
    517 {
    518         page->lru.next = (struct list_head *)cache;
    519 }
    520
    521 static inline struct kmem_cache *page_get_cache(struct page *page)
    522 {
    523         page = compound_head(page);
    524         BUG_ON(!PageSlab(page));
    525         return (struct kmem_cache *)page->lru.next;
    526 }
    527
    528 static inline void page_set_slab(struct page *page, struct slab *slab)
    529 {
    530         page->lru.prev = (struct list_head *)slab;
    531 }
    532
    533 static inline struct slab *page_get_slab(struct page *page)
    534 {
    535         BUG_ON(!PageSlab(page));
    536         return (struct slab *)page->lru.prev;
    537 }

 

    396 static void skb_release_data(struct sk_buff *skb)
    397 {
    398         if (!skb->cloned ||
    399             !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
    400                                &skb_shinfo(skb)->dataref)) {
    401                 if (skb_shinfo(skb)->nr_frags) {
    402                         int i;
    403                         for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
    404                                 put_page(skb_shinfo(skb)->frags[i].page);
    405                 }
    406
    407                 /*
    408                  * If skb buf is from userspace, we need to notify the caller
    409                  * the lower device DMA has done;
    410                  */
    411                 if (skb_tx(skb)->dev_zerocopy) {
    412                         struct ubuf_info *uarg;
    413
    414                         uarg = skb_shinfo(skb)->destructor_arg;
    415                         if (uarg->callback)
    416                                 uarg->callback(uarg);
    417                 }
    418
    419                 if (skb_has_frag_list(skb))
    420                         skb_drop_fraglist(skb);
    421
    422                 kfree(skb->head);
    423         }
    424 }

可以看到skb_release_data函数需要去调用kfree进行释放;进而报错了。从上面分析来看初步怀疑是IB驱动问题导致;如何查看IB相关的源代码呢?首先我们来看下该环境的IB驱动版本:

Apr 20 18:58:54 mpbdb2 kernel: Backport based on mlnx_ofed/mlnx-ofa_kernel-4.0.git b4fdfac
Apr 20 18:58:54 mpbdb2 kernel: compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core: Mellanox ConnectX core driver v4.5-1.0.1
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core: Initializing 0000:03:00.0
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core 0000:03:00.0: PCI INT A -> GSI 34 (level, low) -> IRQ 34
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core: device is working in RoCE mode: Roce V1
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core: UD QP Gid type is: V1
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core 0000:03:00.0: DMFS high rate steer mode is: default performance
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core: Initializing 0000:04:00.0
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core 0000:04:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core: device is working in RoCE mode: Roce V1
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core: UD QP Gid type is: V1
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core 0000:04:00.0: DMFS high rate steer mode is: default performance
Apr 20 18:58:54 mpbdb2 kernel: mlx4_core 0000:04:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)

从dmesg日志来看是4.5版本。我这里在https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/mellanox/mlx4/en_tx.c#L1077 上面可以查看到4.x版本相关函数代码,供参考。

 


评论

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注