Using the ILOM for troubleshooting on ODA

Posted in: Technical Track

I worked on root cause analysis for a strange node reboot on client’s Oracle Database Appliance yesterday. The case was quite interesting from the perspective that none of the logs contained any information related to the cause of the reboot. I could only see the log entries for normal activities and then – BOOM! – the start-up sequence! It looked like someone just power cycled the node. I also observed the heartbeat timeouts followed by the node eviction on the remaining node. There was still one place I hadn’t checked and it revealed the cause of the issue.

One of the cool things about ODA is it’s service processor (SP) called Integrated Lights Out Manager (ILOM), which allows you to do many things that you’d normally do being physically located in the data center – power cycle the node, change the BIOS settings, choose boot devices, and … (the drum-roll) … see the console outputs from the server node! And it doesn’t only show the current console output but it keeps logging it too. Each ODA server has its own ILOM, so I found out the IP address for the ILOM of the node which failed and connected to it using SSH.

$ ssh [email protected]
Password:
Oracle(R) Integrated Lights Out Manager
Version 3.0.14.13.a r70764
Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved.
->
-> ls
 /
    Targets:
        HOST
        STORAGE
        SYS
        SP
    Properties:
    Commands:
        cd
        show

ILOM can be browsed as it would be a directory structure. Here the “Targets” are different components of the system. When you “cd” into a target you see sub-components and so on. Each target can have properties, they are displayed as variable=value pairs under “Properties” section. And there are also list of “Commands” that you can execute for the current target. the “ls” command shows the sub-targets, the properties and the commands for the current target. Here’s how I found the console outputs from the failed node:

-> cd HOST
/HOST
-> ls
 /HOST
    Targets:
        console
        diag
    Properties:
        boot_device = default
        generate_host_nmi = (Cannot show property)
    Commands:
        cd
        set
        show
-> cd console
/HOST/console
-> ls
 /HOST/console
    Targets:
        history
    Properties:
        line_count = 0
        pause_count = 0
        start_from = end
    Commands:
        cd
        show
        start
        stop
-> cd history
/HOST/console/history
-> ls

The last “ls” command started printing all the history of console outputs on my screen and look what I found just before the startup sequence (I removed some lines to make this shorter and I also highlighted the most interesting lines):

divide error: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:1f:00.0/host7/port-7:1/expander-7:1/port-7:1:2/end_device-7:1:2/target7:0:15/7:0:15:0/timeout
CPU 3
Modules linked in: iptable_filter(U) ip_tables(U) x_tables(U) oracleacfs(P)(U) oracleadvm(P)(U) oracleoks(P)(U) mptctl(U) mptbase(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U) rfkill(U) nfs(U) fscache(U) nfs_acl(U) auth_rpcgss(U) lockd(U) sunrpc(U) bonding(U) be2iscsi(U) ib_iser(U) rdma_cm(U) ib_cm(U) iw_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) iscsi_tcp(U) bnx2i(U) cnic(U) uio(U) dm_round_robin(U) ipv6(U) cxgb3i(U) libcxgbi(U) cxgb3(U) mdio(U) libiscsi_tcp(U) libiscsi(U) scsi_transport_iscsi(U) video(U
) output(U) sbs(U) sbshc(U) parport_pc(U) lp(U) parport(U) ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) igb(U) ixgbe(U) joydev(U) ses(U) enclosure(U) e1000e(U) snd_seq_dummy(U) snd_seq_oss(U) snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U) snd_mixer_oss(U) snd_pcm(U) snd_timer(U) snd(U) soundcore(U) snd_page_alloc(U) iTCO_wdt(U) iTCO_vendor_support(U) shpchp(U) i2c_i801(U) i2c_core(U) ioatdma(U) dca(U) pcspkr(U) dm_multipath(U) usb_storage(U) mpt2sas(U) scsi_transport_sas(U) raid_class(U)
 ahci(U) raid1(U) [last unloaded: microcode]
Pid: 29478, comm: top Tainted: P        W  2.6.32-300.11.1.el5uek #1 SUN FIRE X4370 M2 SERVER
RIP: 0010:[<ffffffff8104b3e8>]  [<ffffffff8104b3e8>] thread_group_times+0x5b/0xab
...
Kernel panic - not syncing: Fatal exception
Pid: 29478, comm: top Tainted: P      D W  2.6.32-300.11.1.el5uek #1
Call Trace:
 [<ffffffff8105797e>] panic+0xa5/0x162
 [<ffffffff8107ae09>] ? up+0x39/0x3e
 [<ffffffff810580d1>] ? release_console_sem+0x194/0x19d
 [<ffffffff8105839a>] ? console_unblank+0x6a/0x6f
 [<ffffffff8105764b>] ? print_oops_end_marker+0x23/0x25
 [<ffffffff81456ea6>] oops_end+0xb7/0xc7
 [<ffffffff8101565d>] die+0x5a/0x63
 [<ffffffff8145677c>] do_trap+0x115/0x124
 [<ffffffff81013674>] do_divide_error+0x96/0x9f
 [<ffffffff8104b3e8>] ? thread_group_times+0x5b/0xab
 [<ffffffff810dd2f8>] ? get_page_from_freelist+0x4be/0x65e
 [<ffffffff81012b1b>] divide_error+0x1b/0x20
 [<ffffffff8104b3e8>] ? thread_group_times+0x5b/0xab
 [<ffffffff8104b3d4>] ? thread_group_times+0x47/0xab
 [<ffffffff8116ee13>] ? collect_sigign_sigcatch+0x46/0x5e
 [<ffffffff8116f366>] do_task_stat+0x354/0x8c3
 [<ffffffff81238267>] ? put_dec+0xcf/0xd2
 [<ffffffff81238396>] ? number+0x12c/0x244
 [<ffffffff8107419b>] ? get_pid_task+0xe/0x19
 [<ffffffff811eac34>] ? security_task_to_inode+0x16/0x18
 [<ffffffff8116a77b>] ? task_lock+0x15/0x17
 [<ffffffff8116add1>] ? task_dumpable+0x29/0x3c
 [<ffffffff8116c1c6>] ? pid_revalidate+0x80/0x99
 [<ffffffff81135992>] ? seq_open+0x25/0xba
 [<ffffffff81135a08>] ? seq_open+0x9b/0xba
 [<ffffffff8116d147>] ? proc_single_show+0x0/0x7a
 [<ffffffff81135b2e>] ? single_open+0x8f/0xb8
 [<ffffffff8116aa0e>] ? proc_single_open+0x23/0x3b
 [<ffffffff81127cc1>] ? do_filp_open+0x4f8/0x92d
 [<ffffffff8116f8e9>] proc_tgid_stat+0x14/0x16
 [<ffffffff8116d1a6>] proc_single_show+0x5f/0x7a
 [<ffffffff81135e73>] seq_read+0x193/0x350
 [<ffffffff811ea88c>] ? security_file_permission+0x16/0x18
 [<ffffffff8111a797>] vfs_read+0xad/0x107
 [<ffffffff8111b24b>] sys_read+0x4c/0x70
 [<ffffffff81011db2>] system_call_fastpath+0x16/0x1b
Rebooting in 60 seconds..???

A quick search on My Oracle Support quickly found a match: Kernel Panic at “thread_group_times+0x5b/0xab” (Doc ID 1620097.1)”. The call stack and the massages are a 100% match and the root cause is a kernel bug that’s fixed in more recent versions.
I’m not sure how I would have gotten to the root cause if this system was not an ODA and the server would have just bounced without logging the Kernel Panic in any of the logs. ODA’s ILOM definitely made the troubleshooting effort less painful and probably saved us from couple more incidents caused by this bug in the future as we’d been able to troubleshoot it quicklyand we’ll be able to implement the fix sooner.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Maris Elsins is an experienced Oracle Applications DBA currently working as Lead Database Consultant at The Pythian Group. His main areas of expertise are troubleshooting and performance tuning of Oracle Database and e-Business Suite systems. He is a blogger and a frequent speaker at Oracle related conferences such as UKOUG, Collaborate, Oracle OpenWorld, HotSos, and others. Maris is an Oracle ACE, an Oracle Certified Master, and a co-author of “Practical Oracle Database Appliance” (Apress, 2014). He's also a member of the board at Latvian Oracle User Group.

1 Comment. Leave new

You probably want to look at netconsole

alvaro

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *