Discussion:
[PATCH] x86/mce: Keep quiet in case of broadcasted mce after system panic
Xunlei Pang
2017-01-23 08:01:51 UTC
Permalink
We met an issue for kdump: after kdump kernel boots up,
and there comes a broadcasted mce in first kernel, the
other cpus remaining in first kernel will enter the old
mce handler of first kernel, then timeout and panic due
to MCE synchronization, finally reset the kdump cpus.

This patch lets cpus stay quiet when panic happens, so
before crash cpu shots them down or after kdump boots,
they should not do anything except clearing MCG_STATUS
in case of broadcasted mce. This is useful for kdump
to let the vmcore dumping perform as hard as it can.

Previous efforts:
https://patchwork.kernel.org/patch/6167631/
https://lists.gt.net/linux/kernel/2146557

Cc: Naoya Horiguchi <n-***@ah.jp.nec.com>
Signed-off-by: Xunlei Pang <***@redhat.com>
---
arch/x86/kernel/cpu/mcheck/mce.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef432..0c2bf77 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1157,6 +1157,23 @@ void do_machine_check(struct pt_regs *regs, long error_code)

mce_gather_info(&m, regs);

+ /*
+ * Check if this MCE is signaled to only this logical processor,
+ * on Intel only.
+ */
+ if (m.cpuvendor == X86_VENDOR_INTEL)
+ lmce = m.mcgstatus & MCG_STATUS_LMCES;
+
+ /*
+ * Special treatment for Intel broadcasted machine check:
+ * To avoid panic due to MCE synchronization in case of kdump,
+ * after system panic, clear global status and bail out.
+ */
+ if (!lmce && atomic_read(&panic_cpu) != PANIC_CPU_INVALID) {
+ wrmsrl(MSR_IA32_MCG_STATUS, 0);
+ goto out;
+ }
+
final = this_cpu_ptr(&mces_seen);
*final = m;

@@ -1174,13 +1191,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
kill_it = 1;

/*
- * Check if this MCE is signaled to only this logical processor,
- * on Intel only.
- */
- if (m.cpuvendor == X86_VENDOR_INTEL)
- lmce = m.mcgstatus & MCG_STATUS_LMCES;
-
- /*
* Go through all banks in exclusion of the other CPUs. This way we
* don't report duplicated events on shared banks because the first one
* to see it will clear it. If this is a Local MCE, then no need to
--
1.8.3.1
Borislav Petkov
2017-01-23 12:51:57 UTC
Permalink
Post by Xunlei Pang
We met an issue for kdump: after kdump kernel boots up,
and there comes a broadcasted mce in first kernel, the
How does that even happen?

Lemme try to understand this correctly: the first kernel gets an
MCE, kdump starts and boots a *whole* kernel and *then* you get the
broadcasted MCE? I have real hard time believing that.

What happened to the approach of clearing CR4.MCE before loading the
kdump kernel, in native_machine_shutdown() or wherever does the kdump
gets loaded...
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Xunlei Pang
2017-01-23 13:35:53 UTC
Permalink
Post by Borislav Petkov
Post by Xunlei Pang
We met an issue for kdump: after kdump kernel boots up,
and there comes a broadcasted mce in first kernel, the
How does that even happen?
Lemme try to understand this correctly: the first kernel gets an
MCE, kdump starts and boots a *whole* kernel and *then* you get the
broadcasted MCE? I have real hard time believing that.
What happened to the approach of clearing CR4.MCE before loading the
kdump kernel, in native_machine_shutdown() or wherever does the kdump
gets loaded...
One possible timing sequence would be:
1st kernel running on multiple cpus panicked
then the crash dump code starts
the crash dump code stops the others cpus except the crashing one
2nd kernel boots up on the crash cpu with "nr_cpus=1"
some broadcasted mce comes on some cpu amongst the other cpus(not the crashing cpu)
the other cpus enter old mce handler of 1st kernel, while crash cpu enters new mce handler of 2nd kernel
the old mce handler of 1st kernel will timeout and panic due to mce syncrhonization under default setting

Regards,
Xunlei
Borislav Petkov
2017-01-23 14:50:56 UTC
Permalink
Post by Xunlei Pang
1st kernel running on multiple cpus panicked
then the crash dump code starts
the crash dump code stops the others cpus except the crashing one
2nd kernel boots up on the crash cpu with "nr_cpus=1"
some broadcasted mce comes on some cpu amongst the other cpus(not the crashing cpu)
Where does this broadcasted MCE come from?

The crash dump code triggered it? Or it happened before the panic()?

Are you talking about an *actual* sequence which you're experiencing on
real hw or is this something hypothetical?
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Luck, Tony
2017-01-23 17:40:09 UTC
Permalink
Post by Borislav Petkov
Post by Xunlei Pang
1st kernel running on multiple cpus panicked
then the crash dump code starts
the crash dump code stops the others cpus except the crashing one
2nd kernel boots up on the crash cpu with "nr_cpus=1"
some broadcasted mce comes on some cpu amongst the other cpus(not the crashing cpu)
Where does this broadcasted MCE come from?
The crash dump code triggered it? Or it happened before the panic()?
Are you talking about an *actual* sequence which you're experiencing on
real hw or is this something hypothetical?
If the system had experienced some memory corruption, but
recovered ... then there would be some pages sitting around
that the old kernel had marked as POISON and stopped using.
The kexec'd kernel doesn't know about these, so may touch that
memory while taking a crash dump ... and then you have a
broadcast machine check (on older[1] Intel CPUs that don't support
local machine check).

This is hard to work around. You really need all the CPUs to
have set CR4.MCE=1 (if any didn't, then they will force a reset
when they see the machine check). Also you need to make sure that
they jump to the copy of do_machine_check() in the new kernel, not
the old kernel.

A while ago I played with the nr_cpus=N code to have it bring
all the CPUs far enough online to get the machine check initialization
done, then any extras above "N" just go back offline again.
But I never got this to work reliably.

-Tony

[1] older == all released ones, at the moment.
Borislav Petkov
2017-01-23 17:51:30 UTC
Permalink
Hey Tony,

a "welcome back" is in order? :-)
Post by Luck, Tony
If the system had experienced some memory corruption, but
recovered ... then there would be some pages sitting around
that the old kernel had marked as POISON and stopped using.
The kexec'd kernel doesn't know about these, so may touch that
memory while taking a crash dump ...
Hmm, pass a list of poisoned pages to the kdump kernel so as not to
touch. Looks like there's already functionality for that:

"makedumpfile can exclude the following types of pages while copying
VMCORE to DUMPFILE, and a user can choose which type of pages will be
excluded.

- Pages filled with zero
- Cache pages
- User process data pages
- Free pages"

(there is a makedumpfile manpage somewhere)

And apparently crash knows about poisoned pages and handles them:

static int __init crash_save_vmcoreinfo_init(void)
{
...
#ifdef CONFIG_MEMORY_FAILURE
VMCOREINFO_NUMBER(PG_hwpoison);
#endif

so if that works, the kexeced kernel should know about that list.
Post by Luck, Tony
and then you have a broadcast machine check (on older[1] Intel CPUs
that don't support local machine check).
Right.
Post by Luck, Tony
This is hard to work around. You really need all the CPUs to have set
CR4.MCE=1 (if any didn't, then they will force a reset when they see
the machine check). Also you need to make sure that they jump to the
copy of do_machine_check() in the new kernel, not the old kernel.
Doesn't matter, right? The new copy is as clueless as the old one about
those MCEs.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Luck, Tony
2017-01-23 18:01:53 UTC
Permalink
Post by Borislav Petkov
Hey Tony,
a "welcome back" is in order? :-)
Yes - first day back today. Lots of catching up to do.
Post by Borislav Petkov
static int __init crash_save_vmcoreinfo_init(void)
{
...
#ifdef CONFIG_MEMORY_FAILURE
VMCOREINFO_NUMBER(PG_hwpoison);
#endif
so if that works, the kexeced kernel should know about that list.
Oh good ... it is smarter than I thought.
Post by Borislav Petkov
Doesn't matter, right? The new copy is as clueless as the old one about
those MCEs.
If things are well enough initialized that we don't reset, and
get to do_machine_check(), then this code from Ashok:

/* If this CPU is offline, just bail out. */
if (cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;

mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
}

will ignore the machine check on the other cpus ... assuming
that "cpu_is_offline(smp_processor_id())" does the right thing
in the kexec case where this is an "old" cpu that isn't online
in the new kernel.

-Tony
Borislav Petkov
2017-01-23 18:14:37 UTC
Permalink
Post by Luck, Tony
will ignore the machine check on the other cpus ... assuming
that "cpu_is_offline(smp_processor_id())" does the right thing
in the kexec case where this is an "old" cpu that isn't online
in the new kernel.
Nice. And kdump did do the dumping on one CPU, AFAIR. So we should be
good there.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Xunlei Pang
2017-01-24 02:33:26 UTC
Permalink
Post by Borislav Petkov
Post by Luck, Tony
will ignore the machine check on the other cpus ... assuming
that "cpu_is_offline(smp_processor_id())" does the right thing
in the kexec case where this is an "old" cpu that isn't online
in the new kernel.
Nice. And kdump did do the dumping on one CPU, AFAIR. So we should be
good there.
"nr_cpus=N" will consume more memory, using very large N is almost
impossible for kdump to boot with considering the limited crash memory
reserved.

For some large machine, nr_cpus=1 might not be enough, we have to use
nr_cpus=4 or more, it is also helpful for the vmcore parallel dumping :-)

Regards,
Xunlei
Xunlei Pang
2017-01-24 01:46:48 UTC
Permalink
Post by Borislav Petkov
Hey Tony,
a "welcome back" is in order? :-)
Post by Luck, Tony
If the system had experienced some memory corruption, but
recovered ... then there would be some pages sitting around
that the old kernel had marked as POISON and stopped using.
The kexec'd kernel doesn't know about these, so may touch that
memory while taking a crash dump ...
Hmm, pass a list of poisoned pages to the kdump kernel so as not to
"makedumpfile can exclude the following types of pages while copying
VMCORE to DUMPFILE, and a user can choose which type of pages will be
excluded.
- Pages filled with zero
- Cache pages
- User process data pages
- Free pages"
(there is a makedumpfile manpage somewhere)
static int __init crash_save_vmcoreinfo_init(void)
{
...
#ifdef CONFIG_MEMORY_FAILURE
VMCOREINFO_NUMBER(PG_hwpoison);
#endif
so if that works, the kexeced kernel should know about that list.
From the log in my previous reply, MCE occurred before makedumpfile dumping,
so I guess if the poisoned ones belong to the crash reserved memory or other
type of events?

Besides, some kdump kernel may not use makedumpfile, for example a simple "cp"
is also allowed to process "/proc/vmcore".
Post by Borislav Petkov
Post by Luck, Tony
and then you have a broadcast machine check (on older[1] Intel CPUs
that don't support local machine check).
Right.
Post by Luck, Tony
This is hard to work around. You really need all the CPUs to have set
CR4.MCE=1 (if any didn't, then they will force a reset when they see
the machine check). Also you need to make sure that they jump to the
copy of do_machine_check() in the new kernel, not the old kernel.
Doesn't matter, right? The new copy is as clueless as the old one about
those MCEs.
It's the code in mce_start(), it waits for all the online cpus including the cpus
that kdump boots on to synchronize.

So for new mce handler of kdump kernel, it is fine as the number of online cpus
is correct; as for old mce handler of 1st kernel, it's not true because some cpus
which are regarded online from 1st kernel's view are running the 2nd kernel now,
they can't respond to the old mce handler which will timeout the old mce handler.

Regards,
Xunlei
Xunlei Pang
2017-01-24 01:51:55 UTC
Permalink
Post by Xunlei Pang
Post by Borislav Petkov
Hey Tony,
a "welcome back" is in order? :-)
Post by Luck, Tony
If the system had experienced some memory corruption, but
recovered ... then there would be some pages sitting around
that the old kernel had marked as POISON and stopped using.
The kexec'd kernel doesn't know about these, so may touch that
memory while taking a crash dump ...
Hmm, pass a list of poisoned pages to the kdump kernel so as not to
"makedumpfile can exclude the following types of pages while copying
VMCORE to DUMPFILE, and a user can choose which type of pages will be
excluded.
- Pages filled with zero
- Cache pages
- User process data pages
- Free pages"
(there is a makedumpfile manpage somewhere)
static int __init crash_save_vmcoreinfo_init(void)
{
...
#ifdef CONFIG_MEMORY_FAILURE
VMCOREINFO_NUMBER(PG_hwpoison);
#endif
so if that works, the kexeced kernel should know about that list.
From the log in my previous reply, MCE occurred before makedumpfile dumping,
so I guess if the poisoned ones belong to the crash reserved memory or other
type of events?
Another possibility may be from any system.reserved/pcie memory
which are shared between 1st and 2nd kernel.
Post by Xunlei Pang
Besides, some kdump kernel may not use makedumpfile, for example a simple "cp"
is also allowed to process "/proc/vmcore".
Post by Borislav Petkov
Post by Luck, Tony
and then you have a broadcast machine check (on older[1] Intel CPUs
that don't support local machine check).
Right.
Post by Luck, Tony
This is hard to work around. You really need all the CPUs to have set
CR4.MCE=1 (if any didn't, then they will force a reset when they see
the machine check). Also you need to make sure that they jump to the
copy of do_machine_check() in the new kernel, not the old kernel.
Doesn't matter, right? The new copy is as clueless as the old one about
those MCEs.
It's the code in mce_start(), it waits for all the online cpus including the cpus
that kdump boots on to synchronize.
So for new mce handler of kdump kernel, it is fine as the number of online cpus
is correct; as for old mce handler of 1st kernel, it's not true because some cpus
which are regarded online from 1st kernel's view are running the 2nd kernel now,
they can't respond to the old mce handler which will timeout the old mce handler.
Regards,
Xunlei
Xunlei Pang
2017-01-24 01:27:45 UTC
Permalink
Post by Borislav Petkov
Post by Xunlei Pang
1st kernel running on multiple cpus panicked
then the crash dump code starts
the crash dump code stops the others cpus except the crashing one
2nd kernel boots up on the crash cpu with "nr_cpus=1"
some broadcasted mce comes on some cpu amongst the other cpus(not the crashing cpu)
Where does this broadcasted MCE come from?
The crash dump code triggered it? Or it happened before the panic()?
Are you talking about an *actual* sequence which you're experiencing on
real hw or is this something hypothetical?
It occurred on real hardware when testing crash dump.

1) SysRq-c was injected for the test in 1st kernel
[ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for kdump
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.el7.x86_64 root=UUID=976a15c8-8cbe-44ad-bb91-23f9b18e8789 ro console=ttyS1,115200 nmi_watchdog=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug disable_cpu_apicid=0 elfcorehdr=869772K 3) An MCE came to the 1st kernel, timeout panic occurred, and rebooted the machine
[ 6.095706] Dazed and confused, but trying to continue // message of the 1st kernel
[ 81.655507] Kernel panic - not syncing: Timeout synchronizing machine check over CPUs
[ 82.729324] Shutting down cpus with NMI
[ 82.774539] drm_kms_helper: panic occurred, switching back to text console
[ 82.782257] Rebooting in 10 seconds..

Please see the attached for the full log. Regards, Xunlei
Borislav Petkov
2017-01-24 12:22:12 UTC
Permalink
Post by Xunlei Pang
It occurred on real hardware when testing crash dump.
1) SysRq-c was injected for the test in 1st kernel
[ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for kdump
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.el7.x86_64 root=UUID=976a15c8-8cbe-44ad-bb91-23f9b18e8789
Yeah, no, I'm not debugging the RH Frankenstein kernel.

Please retrigger this with latest tip/master first.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Xunlei Pang
2017-01-26 06:30:02 UTC
Permalink
Post by Borislav Petkov
Post by Xunlei Pang
It occurred on real hardware when testing crash dump.
1) SysRq-c was injected for the test in 1st kernel
[ 49.897279] SysRq : Trigger a crash 2) The 2nd kernel started for kdump
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.el7.x86_64 root=UUID=976a15c8-8cbe-44ad-bb91-23f9b18e8789
Yeah, no, I'm not debugging the RH Frankenstein kernel.
Please retrigger this with latest tip/master first.
The hardware machine check is hard to reproduce, but the mce code of RHEL7 is quite
the same as that of tip/master, anyway we are able to inject software mce to reproduce it.

It is also clear from the theoretical analysis of the code.

Regards,
Xunlei
Borislav Petkov
2017-01-26 06:44:00 UTC
Permalink
Post by Xunlei Pang
The hardware machine check is hard to reproduce, but the mce code of
RHEL7 is quite the same as that of tip/master, anyway we are able to
inject software mce to reproduce it.
Please give me your exact steps so that I can try to reproduce it here
too.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Xunlei Pang
2017-02-16 05:36:37 UTC
Permalink
Post by Borislav Petkov
Post by Xunlei Pang
The hardware machine check is hard to reproduce, but the mce code of
RHEL7 is quite the same as that of tip/master, anyway we are able to
inject software mce to reproduce it.
Please give me your exact steps so that I can try to reproduce it here
too.
Hi Borislav,

I tried to use qemu to inject SRAO("mce -b 0 0 0xb100000000000000 0x5 0x0 0x0"),
it works well in 1st kernel, but it doesn't work for 1st kernel after kdump boots(seems
the cpus remain in 1st kernel don't respond to the simulated broadcasting mce).

But in theory, we know cpus belong to kdump kernel can't respond to the
old mce handler, so a single SRAO injection in 1st kernel should be similar.
For example, I used "... -smp 2 -cpu Haswell" to launch a simulation with broadcast
mce supported, and inject SRAO to cpu0 only through qemu monitor
"mce 0 0 0xb100000000000000 0x5 0x0 0x0", cpu0 will timeout/panic and reboot
the machine as follows(running on linux-4.9):
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Kernel Offset: disabled
Rebooting in 30 seconds..

Regards,
Xunlei
Borislav Petkov
2017-02-16 10:18:45 UTC
Permalink
Post by Xunlei Pang
I tried to use qemu to inject SRAO("mce -b 0 0 0xb100000000000000 0x5 0x0 0x0"),
it works well in 1st kernel, but it doesn't work for 1st kernel after kdump boots(seems
the cpus remain in 1st kernel don't respond to the simulated broadcasting mce).
But in theory, we know cpus belong to kdump kernel can't respond to the
old mce handler, so a single SRAO injection in 1st kernel should be similar.
For example, I used "... -smp 2 -cpu Haswell" to launch a simulation with broadcast
mce supported, and inject SRAO to cpu0 only through qemu monitor
"mce 0 0 0xb100000000000000 0x5 0x0 0x0", cpu0 will timeout/panic and reboot
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Sounds to me like you're trying hard to prove some point of yours which
doesn't make much sense to me. And when you say "in theory", that makes
it even less believable. So I remember asking you for exact steps. That
above doesn't read like steps but like some babbling and I've actually
tried to make sense of it for a couple of minutes but failed.

So lemme spell it out for ya. I'd like for you to give me this:

1. Build kernel with this config
2. Boot it in kvm with this settings
3. Do this in the guest
4. Do that in the guest
5. ...
6. ...


And all should be exact commands so that I can do them here on my machine.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Xunlei Pang
2017-02-16 11:52:09 UTC
Permalink
Post by Borislav Petkov
Post by Xunlei Pang
I tried to use qemu to inject SRAO("mce -b 0 0 0xb100000000000000 0x5 0x0 0x0"),
it works well in 1st kernel, but it doesn't work for 1st kernel after kdump boots(seems
the cpus remain in 1st kernel don't respond to the simulated broadcasting mce).
But in theory, we know cpus belong to kdump kernel can't respond to the
old mce handler, so a single SRAO injection in 1st kernel should be similar.
For example, I used "... -smp 2 -cpu Haswell" to launch a simulation with broadcast
mce supported, and inject SRAO to cpu0 only through qemu monitor
"mce 0 0 0xb100000000000000 0x5 0x0 0x0", cpu0 will timeout/panic and reboot
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Sounds to me like you're trying hard to prove some point of yours which
doesn't make much sense to me. And when you say "in theory", that makes
it even less believable. So I remember asking you for exact steps. That
above doesn't read like steps but like some babbling and I've actually
tried to make sense of it for a couple of minutes but failed.
1. Build kernel with this config
2. Boot it in kvm with this settings
3. Do this in the guest
4. Do that in the guest
5. ...
6. ...
And all should be exact commands so that I can do them here on my machine.
Sorry, missed your point.

The steps should be as follows:
1. Prepare a multi-core intel machine with broadcasted mce support.
Enable kdump(crashkernel=256M) and configure kdump kernel to boot with "nr_cpus=1".
2. Activate kdump, and crash the first kernel on some cpu, say cpu1
(taskset -c 1 echo 0 > /proc/sysrq-trigger), then kdump will boot on cpu1.
3. After kdump boots up(let it enter shell), trigger a SRAO on cpu1
(QEMU monitor cmd: mce -b 1 0 0xb100000000000000 0x5 0x0 0x0),
then mce will be broadcast to the other cpus which are still running
in the first kernel(i.e. looping in crash_nmi_callback).
If you own some hardware to inject mce, it would be great, as QEMU does not work correctly for me.
4. Then something like below is expected to happen:

[ 1.468556] tsc: Refined TSC clocksource calibration: 2933.437 MHz
Starting Kdump Vmcore Save Service...
kdump: saving to /sysroot//var/crash/127.0.0.1-2015-09-01-05:07:03/
kdump: saving vmcore-dmesg.txt
[ 39.000010] mce: [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 2: bd0000000000017a
[ 39.000010] mce: [Hardware Error]: TSC 0 ADDR 61600000 MISC 8c
[ 39.000010] mce: [Hardware Error]: PROCESSOR 0:106a3 TIME 1441083980 SOCKET 0 APIC 0 microcode 1
[ 39.000010] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 39.000010] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
[ 39.000010] Shutting down cpus with NMI
[ 1.758463] Uhhuh. NMI received for unknown reason 20 on CPU 0.
[ 1.758463] Do you have a strange power saving mode enabled?
[ 1.758463] Dazed and confused, but trying to continue
[ 39.000010] Rebooting in 30 seconds..

Regards,
Xunlei
Borislav Petkov
2017-02-16 12:22:16 UTC
Permalink
Post by Xunlei Pang
then mce will be broadcast to the other cpus which are still running
in the first kernel(i.e. looping in crash_nmi_callback).
Simple: the crash code should really mark CPUs as not being online:

void do_machine_check(struct pt_regs *regs, long error_code)

...

/* If this CPU is offline, just bail out. */
if (cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;

mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
}

because looping in crash_nmi_callback() does not really denote them as
CPUs being online.

And just so that you don't disturb the machine too much during crashing,
you could simply clear them from the online masks, i.e., perhaps call
remove_cpu_from_maps() with the proper locking around it instead of
doing a full cpu_down().

The machine will be killed anyway after kdump is done writing out
memory.
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Xunlei Pang
2017-02-17 01:53:21 UTC
Permalink
Post by Xunlei Pang
Post by Xunlei Pang
then mce will be broadcast to the other cpus which are still running
in the first kernel(i.e. looping in crash_nmi_callback).
void do_machine_check(struct pt_regs *regs, long error_code)
...
/* If this CPU is offline, just bail out. */
if (cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;
mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
}
because looping in crash_nmi_callback() does not really denote them as
CPUs being online.
And just so that you don't disturb the machine too much during crashing,
you could simply clear them from the online masks, i.e., perhaps call
remove_cpu_from_maps() with the proper locking around it instead of
doing a full cpu_down().
It changes the value of cpu_online_mask/etc which will cause confusion to vmcore analysis.
Moreover, for the code(see comment inlined)

if (cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;

mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) { // This condition may be not true, the mce triggered on kdump cpu
// doesn't need to have this bit set for the other cpus remain in 1st kernel.
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
}


Regards,
Xunlei
Post by Xunlei Pang
The machine will be killed anyway after kdump is done writing out
memory.
Borislav Petkov
2017-02-17 09:07:35 UTC
Permalink
Post by Xunlei Pang
It changes the value of cpu_online_mask/etc which will cause confusion to vmcore analysis.
Then export the crashing_cpu variable, initialize it to something
invalid in the first kernel, -1 for example, and test it in the #MC
handlier like this:

int cpu;

...

cpu = smp_processor_id();

if (cpu_is_offline(cpu) ||
((crashing_cpu != -1) && (crashing_cpu != cpu)) {
u64 mcgstatus;

mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
}
Post by Xunlei Pang
Moreover, for the code(see comment inlined)
if (cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;
mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) { // This condition may be not true, the mce triggered on kdump cpu
// doesn't need to have this bit set for the other cpus remain in 1st kernel.
Is this on kvm or on a real hardware? Because for kvm I don't care. And
don't say "theoretically".
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Xunlei Pang
2017-02-17 16:21:42 UTC
Permalink
Post by Borislav Petkov
Post by Xunlei Pang
It changes the value of cpu_online_mask/etc which will cause confusion to vmcore analysis.
Then export the crashing_cpu variable, initialize it to something
invalid in the first kernel, -1 for example, and test it in the #MC
int cpu;
...
cpu = smp_processor_id();
if (cpu_is_offline(cpu) ||
((crashing_cpu != -1) && (crashing_cpu != cpu)) {
u64 mcgstatus;
mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
}
Yes, it is doable, I will do some tests later.
Post by Borislav Petkov
Post by Xunlei Pang
Moreover, for the code(see comment inlined)
if (cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;
mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
if (mcgstatus & MCG_STATUS_RIPV) { // This condition may be not true, the mce triggered on kdump cpu
// doesn't need to have this bit set for the other cpus remain in 1st kernel.
Is this on kvm or on a real hardware? Because for kvm I don't care. And
don't say "theoretically".
It's from my understanding, I didn't get the explicit description from the intel SDM on this point.
If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each cpu have MCG_STATUS_RIPV bit set?

Regards,
Xunlei
Luck, Tony
2017-02-21 18:20:19 UTC
Permalink
Post by Xunlei Pang
It's from my understanding, I didn't get the explicit description from the intel SDM on this point.
If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each cpu have MCG_STATUS_RIPV bit set?
MCG_STATUS is a per-thread MSR and will contain the status appropriate for that thread when #MC is delivered.
So the RIPV bit will be set if, and only if, the thread saved a valid return address for this exception. The net result
is that it is almost always set for "innocent bystander" CPUs that were dragged into the exception handler because
of a broadcast #MC. We make the test because if it isn't set, then the do_machine_check() had better not return
because we have no idea where it will return to - since there is not a valid return IP.

-Tony
Xunlei Pang
2017-02-22 05:50:47 UTC
Permalink
Post by Luck, Tony
Post by Xunlei Pang
It's from my understanding, I didn't get the explicit description from the intel SDM on this point.
If a broadcast SRAO comes on real hardware, will MSR_IA32_MCG_STATUS of each cpu have MCG_STATUS_RIPV bit set?
MCG_STATUS is a per-thread MSR and will contain the status appropriate for that thread when #MC is delivered.
So the RIPV bit will be set if, and only if, the thread saved a valid return address for this exception. The net result
is that it is almost always set for "innocent bystander" CPUs that were dragged into the exception handler because
of a broadcast #MC. We make the test because if it isn't set, then the do_machine_check() had better not return
because we have no idea where it will return to - since there is not a valid return IP.
Got it, thanks for the details.

Regards,
Xunlei

Loading...