[PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown

Discussion:

[PATCH v2] x86/mce: Don't participate in rendezvous process once nmi_shootdown_cpus() was made

Xunlei Pang

2017-02-20 06:10:37 UTC

We met an issue for kdump: after kdump kernel boots up,
and there comes a broadcasted mce in first kernel, the
other cpus remaining in first kernel will enter the old
mce handler of first kernel, then timeout and panic due
to MCE synchronization, finally reset the kdump cpus.

This patch lets cpus stay quiet after nmi_shootdown_cpus(),
so before crash cpu shots them down or after kdump boots,
they should not do anything except clearing MCG_STATUS
in case of broadcasted mce. This is useful for kdump
to let the vmcore dumping perform as hard as it can.

Previous efforts:
https://patchwork.kernel.org/patch/6167631/
https://lists.gt.net/linux/kernel/2146557

Cc: Naoya Horiguchi <n-***@ah.jp.nec.com>
Suggested-by: Borislav Petkov <***@alien8.de>
Signed-off-by: Xunlei Pang <***@redhat.com>
---
v1->v2:
Using crashing_cpu according to Borislav's suggestion.

arch/x86/include/asm/reboot.h | 1 +
arch/x86/kernel/cpu/mcheck/mce.c | 6 ++++--
arch/x86/kernel/reboot.c | 16 +++++++++++++++-
3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 2cb1cc2..ec8657b6 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -26,5 +26,6 @@ struct machine_ops {
typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
void nmi_shootdown_cpus(nmi_shootdown_cb callback);
void run_crash_ipi_callback(struct pt_regs *regs);
+bool cpus_shotdown(void);

#endif /* _ASM_X86_REBOOT_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8e9725c..3b56710 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -49,6 +49,7 @@
#include <asm/tlbflush.h>
#include <asm/mce.h>
#include <asm/msr.h>
+#include <asm/reboot.h>

#include "mce-internal.h"

@@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
*/
int lmce = 1;

- /* If this CPU is offline, just bail out. */
- if (cpu_is_offline(smp_processor_id())) {
+ /* If nmi shootdown happened or this CPU is offline, just bail out. */
+ if (cpus_shotdown() ||
+ cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;

mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index e244c19..b301c8d 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -752,7 +752,7 @@ void machine_crash_shutdown(struct pt_regs *regs)
#if defined(CONFIG_SMP)

/* This keeps a track of which one is crashing cpu. */
-static int crashing_cpu;
+static int crashing_cpu = -1;
static nmi_shootdown_cb shootdown_callback;

static atomic_t waiting_for_crash_ipi;
@@ -852,6 +852,14 @@ void nmi_panic_self_stop(struct pt_regs *regs)
}
}

+bool cpus_shotdown(void)
+{
+ if (crashing_cpu != -1)
+ return true;
+
+ return false;
+}
+
#else /* !CONFIG_SMP */
void nmi_shootdown_cpus(nmi_shootdown_cb callback)
{
@@ -861,4 +869,10 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
void run_crash_ipi_callback(struct pt_regs *regs)
{
}
+
+bool cpus_shotdown(void)
+{
+ return false;
+}
+
#endif

--
1.8.3.1

Borislav Petkov

2017-02-20 11:09:41 UTC

Permalink

Post by Xunlei Pang
@@ -1128,8 +1129,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
*/
int lmce = 1;
- /* If this CPU is offline, just bail out. */
- if (cpu_is_offline(smp_processor_id())) {
+ /* If nmi shootdown happened or this CPU is offline, just bail out. */
+ if (cpus_shotdown() ||

I don't like "cpus_shotdown" - it doesn't hint at all that this is
special-handling crash/kdump.

And more importantly, I want it to be obvious that we do let the
crashing CPU into the MCE handler.

Why?

If we didn't, you will not handle *any* MCE, even a fatal one, during
dumping memory so if that dump is corrupted from the MCE, you won't
know. And I don't want to be the one staring at the corrupted dump and
wondering why I'm seeing what I'm seeing.

IOW, if we get a fatal MCE during dumping then we should go and die.
This is much better than silently corrupting the dump and not even
saying anything about it.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Xunlei Pang

2017-02-20 13:29:24 UTC

Permalink

Post by Borislav Petkov

I don't like "cpus_shotdown" - it doesn't hint at all that this is
special-handling crash/kdump.
And more importantly, I want it to be obvious that we do let the
crashing CPU into the MCE handler.

Ok, I will export crashing_cpu and use it directly in mce handler.

Post by Borislav Petkov
Why?
If we didn't, you will not handle *any* MCE, even a fatal one, during
dumping memory so if that dump is corrupted from the MCE, you won't
know. And I don't want to be the one staring at the corrupted dump and
wondering why I'm seeing what I'm seeing.
IOW, if we get a fatal MCE during dumping then we should go and die.
This is much better than silently corrupting the dump and not even
saying anything about it.

My thought is that it doesn't matter after kdump boots as new mce handler
will be installed. If we get a fatal MCE during kdumping, the new handler will
handle the cpus running kdump kernel correctly.

There is a small window between crash and kdump kernel boot, so if a SRAO comes
within this window it will also cause the mce synchronization problem on the crashing
cpu if we don't bail out the crashing cpu.

Regards,
Xunlei

Borislav Petkov

2017-02-20 20:26:54 UTC

Permalink

Post by Xunlei Pang
There is a small window between crash and kdump kernel boot, so
if a SRAO comes within this window it will also cause the mce
synchronization problem on the crashing cpu if we don't bail out the
crashing cpu.

You mean, in the window between, kdump kernel starts writing out memory
and the second, kexec-ed kernel?

If so, please add that information to the place in do_machine_check()
where we check crashing_cpu so that we know why we're doing this
temporary ignore of #MC.

Thanks.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Xunlei Pang

2017-02-21 01:28:20 UTC

Permalink

Post by Borislav Petkov

You mean, in the window between, kdump kernel starts writing out memory
and the second, kexec-ed kernel?

Not kdump kernel starts dumping, just during nmi_shootdown_cpus(), if some
MCE comes after crashing_cpu was set and we don't skip crashing_cpu, then
the crashing cpu will enter mce handler and trigger the synchronization issue.

Post by Borislav Petkov
If so, please add that information to the place in do_machine_check()
where we check crashing_cpu so that we know why we're doing this
temporary ignore of #MC.

Ok, will add, thanks for the feedback.

Regards,
Xunlei

Borislav Petkov

2017-02-21 08:41:50 UTC

Permalink

Post by Xunlei Pang
Not kdump kernel starts dumping, just during nmi_shootdown_cpus(), if some
MCE comes after crashing_cpu was set and we don't skip crashing_cpu, then
the crashing cpu will enter mce handler and trigger the synchronization issue.

Ok, I went and looked at __crash_kexec() - so we do nmi_shootdown_cpus()
as part of machine_crash_shutdown(). The next thing we do is kexec the
new kernel in machine_kexec(kexec_crash_image). The picture is clearer
now.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Xunlei Pang

2017-02-21 01:26:29 UTC

Permalink

Post by Xunlei Pang

Post by Borislav Petkov

I don't like "cpus_shotdown" - it doesn't hint at all that this is
special-handling crash/kdump.
And more importantly, I want it to be obvious that we do let the
crashing CPU into the MCE handler.

Ok, I will export crashing_cpu and use it directly in mce handler.

Forget to mention, one reason I introduced cpus_shotdown() is that "crashing_cpu"
is defined with CONFIG_SMP=y, so we have to export it unconditionally if we don't want
to add the conditional code(i.e. with #ifdef CONFIG_SMP quoted) in mce.c.

Regards,
Xunlei

Post by Xunlei Pang

My thought is that it doesn't matter after kdump boots as new mce handler
will be installed. If we get a fatal MCE during kdumping, the new handler will
handle the cpus running kdump kernel correctly.
There is a small window between crash and kdump kernel boot, so if a SRAO comes
within this window it will also cause the mce synchronization problem on the crashing
cpu if we don't bail out the crashing cpu.
Regards,
Xunlei

Xunlei Pang

2017-02-21 12:37:21 UTC

Permalink

Post by Borislav Petkov

I don't like "cpus_shotdown" - it doesn't hint at all that this is
special-handling crash/kdump.
And more importantly, I want it to be obvious that we do let the
crashing CPU into the MCE handler.

Hi Boris,

I made some improvements, what do you think the following one?
If you think it is fine, I can send out v3. Thanks for your time!

---
arch/x86/include/asm/reboot.h | 1 +
arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++++++--
arch/x86/kernel/reboot.c | 5 +++--
3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 2cb1cc2..fc62ba8 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -15,6 +15,7 @@ struct machine_ops {
};

extern struct machine_ops machine_ops;
+extern int crashing_cpu;

void native_machine_crash_shutdown(struct pt_regs *regs);
void native_machine_shutdown(void);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 8e9725c..7f53145 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -49,6 +49,7 @@
#include <asm/tlbflush.h>
#include <asm/mce.h>
#include <asm/msr.h>
+#include <asm/reboot.h>

#include "mce-internal.h"

@@ -1128,8 +1129,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)
*/
int lmce = 1;

- /* If this CPU is offline, just bail out. */
- if (cpu_is_offline(smp_processor_id())) {
+ /*
+ * Cases to bail out to avoid rendezvous process timeout:
+ * 1)If crashing_cpu was set, e.g. entering kdump,
+ * we need to skip cpus remaining in 1st kernel.
+ * 2)If this CPU is offline.
+ */
+ if (crashing_cpu != -1 ||
+ cpu_is_offline(smp_processor_id())) {
u64 mcgstatus;

mcgstatus = mce_rdmsrl(MSR_IA32_MCG_STATUS);
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index e244c19..92ecf4b 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -749,10 +749,11 @@ void machine_crash_shutdown(struct pt_regs *regs)
#endif

+/* This keeps a track of which one is crashing cpu. */
+int crashing_cpu = -1;
+
#if defined(CONFIG_SMP)

-/* This keeps a track of which one is crashing cpu. */
-static int crashing_cpu;
static nmi_shootdown_cb shootdown_callback;

static atomic_t waiting_for_crash_ipi;

--
1.8.3.1

Borislav Petkov

2017-02-21 17:27:03 UTC

Permalink

Post by Xunlei Pang
- /* If this CPU is offline, just bail out. */
- if (cpu_is_offline(smp_processor_id())) {
+ /*
+ * 1)If crashing_cpu was set, e.g. entering kdump,
+ * we need to skip cpus remaining in 1st kernel.
+ * 2)If this CPU is offline.
+ */
+ if (crashing_cpu != -1 ||
+ cpu_is_offline(smp_processor_id())) {

You're still not letting the crashing_cpu enter the #MC handler. You
need to handle the MCE no matter how short the window is.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.