Discussion:
kexec regression since 4.9 caused by efi
Dave Young
2017-03-09 06:38:06 UTC
Permalink
Add efi/kexec list.
Hi, everyone,
[ 0.001000] general protection fault: 0000 [#1] SMP
[ 0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1 #53
[ 0.001000] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05 09/30/2016
[ 0.001000] task: ffffffff81e0e4c0 task.stack: ffffffff81e00000
[ 0.001000] RIP: 0010:virt_efi_set_variable+0x85/0x1a0
[ 0.001000] RSP: 0000:ffffffff81e03e18 EFLAGS: 00010202
[ 0.001000] RAX: afafafafafafafaf RBX: ffffffff81e3a4e0 RCX: 0000000000000007
[ 0.001000] RDX: ffffffff81e03e70 RSI: ffffffff81e3a4e0 RDI: ffff88407f8c2de0
[ 0.001000] RBP: ffffffff81e03e60 R08: 0000000000000000 R09: 0000000000000000
[ 0.001000] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81e03e70
[ 0.001000] R13: 0000000000000007 R14: 0000000000000000 R15: 0000000000000000
[ 0.001000] FS: 0000000000000000(0000) GS:ffff881fff600000(0000) knlGS:0000000000000000
[ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.001000] CR2: ffff88407f30f000 CR3: 0000001fff102000 CR4: 00000000000406b0
[ 0.001000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.001000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.001000] efi_delete_dummy_variable+0x7a/0x80
[ 0.001000] efi_enter_virtual_mode+0x3e2/0x494
[ 0.001000] start_kernel+0x392/0x418
[ 0.001000] ? set_init_arg+0x55/0x55
[ 0.001000] x86_64_start_reservations+0x2a/0x2c
[ 0.001000] x86_64_start_kernel+0xea/0xed
[ 0.001000] start_cpu+0x14/0x14
[ 0.001000] Code: 42 25 8d ff 80 3d 43 77 95 00 00 75 68 9c 8f 04 24 48 8b 05 3e 7d 7e 00 48 89 de 4d 89 f9 4d 89 f0 44 89 e9 4c 89 e2 48 8b 40 58 <48> 8b 78 58 31 c0 e8 90 e4 92 ff 48 8b 3c 24 48 c7 c6 2b 0a ca
[ 0.001000] RIP: virt_efi_set_variable+0x85/0x1a0 RSP: ffffffff81e03e18
[ 0.001000] ---[ end trace 0bd213e540e9b19f ]---
[ 0.001000] Kernel panic - not syncing: Fatal exception
[ 0.001000] ---[ end Kernel panic - not syncing: Fatal exception
Booting normally (i.e., not kexec) still works.
0: 42 25 8d ff 80 3d rex.X and $0x3d80ff8d,%eax
6: 43 77 95 rex.XB ja 0xffffffffffffff9e
9: 00 00 add %al,(%rax)
b: 75 68 jne 0x75
d: 9c pushfq
e: 8f 04 24 popq (%rsp)
11: 48 8b 05 3e 7d 7e 00 mov 0x7e7d3e(%rip),%rax # 0x7e7d56
18: 48 89 de mov %rbx,%rsi
1b: 4d 89 f9 mov %r15,%r9
1e: 4d 89 f0 mov %r14,%r8
21: 44 89 e9 mov %r13d,%ecx
24: 4c 89 e2 mov %r12,%rdx
27: 48 8b 40 58 mov 0x58(%rax),%rax
2b:* 48 8b 78 58 mov 0x58(%rax),%rdi <-- trapping instruction
2f: 31 c0 xor %eax,%eax
31: e8 90 e4 92 ff callq 0xffffffffff92e4c6
36: 48 8b 3c 24 mov (%rsp),%rdi
3a: 48 rex.W
3b: c7 .byte 0xc7
3c: c6 (bad)
3d: 2b 0a sub (%rdx),%ecx
3f: ca .byte 0xca
If I'm reading this correctly, efi.systab->runtime == 0xafafafafafafafaf,
I have no more clue yet from your provided log, but the runtime value is
odd to me. It is set in below code:

arch/x86/platform/efi/efi.c: efi_systab_init()
efi_systab.runtime = data ?
(void *)(unsigned long)data->runtime :
(void *)(unsigne long)systab64->runtime;

Here data is the setup_data passed by kexec-tools from normal kernel to
kexec kernel, efi_setup_data structure is like below:
struct efi_setup_data {
u64 fw_vendor;
u64 runtime;
u64 tables;
u64 smbios;
u64 reserved[8];
};

kexec-tools get the runtime address from /sys/firmware/efi/runtime

So can you do some debuggin on your side, eg. see the sysfs runtime
value is correct or not. And add some printk in efi init path etc.
and we're crashing when we try to dereference that.
nsole=ttyS1,57600 efi=debug
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[ 0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000009ffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000750bdfff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000750be000-0x0000000075ddbfff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000075ddc000-0x0000000075e32fff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x0000000075e33000-0x000000007642dfff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x000000007642e000-0x000000007bcd9fff] reserved
[ 0.000000] BIOS-e820: [mem 0x000000007bcda000-0x000000007bcdafff] usable
[ 0.000000] BIOS-e820: [mem 0x000000007bcdb000-0x000000007bd60fff] reserved
[ 0.000000] BIOS-e820: [mem 0x000000007bd61000-0x000000007bffffff] usable
[ 0.000000] BIOS-e820: [mem 0x000000007c000000-0x000000008fffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed44fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000407fffffff] usable
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] reserve setup_data: [mem 0x0000000000000100-0x000000000009ffff] usable
[ 0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000000010006f] usable
[ 0.000000] reserve setup_data: [mem 0x0000000000100070-0x00000000750bdfff] usable
[ 0.000000] reserve setup_data: [mem 0x00000000750be000-0x0000000075ddbfff] reserved
[ 0.000000] reserve setup_data: [mem 0x0000000075ddc000-0x0000000075e32fff] ACPI data
[ 0.000000] reserve setup_data: [mem 0x0000000075e33000-0x000000007642dfff] ACPI NVS
[ 0.000000] reserve setup_data: [mem 0x000000007642e000-0x000000007bcd9fff] reserved
[ 0.000000] reserve setup_data: [mem 0x000000007bcda000-0x000000007bcdafff] usable
[ 0.000000] reserve setup_data: [mem 0x000000007bcdb000-0x000000007bd60fff] reserved
[ 0.000000] reserve setup_data: [mem 0x000000007bd61000-0x000000007bffffff] usable
[ 0.000000] reserve setup_data: [mem 0x000000007c000000-0x000000008fffffff] reserved
[ 0.000000] reserve setup_data: [mem 0x00000000fed1c000-0x00000000fed44fff] reserved
[ 0.000000] reserve setup_data: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
[ 0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000407fffffff] usable
[ 0.000000] efi: EFI v2.40 by American Megatrends
[ 0.000000] efi: ACPI=0x75f5c000 ACPI 2.0=0x75f5c000 ESRT=0x7bc4d018 SMBIOS=0xf05e0 SMBIOS 3.0=0x7bb2f000 MPS=0xfc9e0
[ 0.000000] efi: mem00: [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007642e000-0x000000007bc50fff] (88MB)
[ 0.000000] efi: mem01: [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007bc51000-0x000000007bcd9fff] (0MB)
[ 0.000000] efi: mem02: [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007bcdb000-0x000000007bd60fff] (0MB)
[ 0.000000] efi: mem03: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x0000000080000000-0x000000008fffffff] (256MB)
[ 0.000000] efi: mem04: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x00000000fed1c000-0x00000000fed44fff] (0MB)
[ 0.000000] efi: mem05: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x00000000ff000000-0x00000000ffffffff] (16MB)
[ 0.000000] SMBIOS 3.0.0 present.
[ 0.000000] DMI: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05 09/30/2016
[ 0.000000] e820: last_pfn = 0x4080000 max_arch_pfn = 0x400000000
[ 0.000000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WC UC- WT
[ 0.000000] x2apic: enabled by BIOS, switching to x2apic ops
[ 0.000000] e820: last_pfn = 0x7c000 max_arch_pfn = 0x400000000
[ 0.000000] esrt: Reserving ESRT space from 0x000000007bc4d018 to 0x000000007bc4d050.
Reverting commit 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and
avoid a kmalloc()") on top of v4.11-rc1 fixes the problem. Bisecting
- Up until 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a
kmalloc()"), kexec worked.
- From 8e80632fb23f to 9d80448ac92b ("efi/arm64: Add debugfs node to
dump UEFI runtime page tables"), kexec just hung after the
"kexec_core: Starting new kernel" message.
- From 3dad6f7f6975 ("x86/efi: Defer efi_esrt_init until after
memblock_x86_fill") to 0a637ee61247 ("x86/efi: Allow invocation of
arbitrary boot services"), kexec hit the BUG_ON(!efi.systab) in
kexec_enter_virtual_mode().
- Finally, after 92dc33501bfb ("x86/efi: Round EFI memmap reservations
to EFI_PAGE_SIZE"), I get the panic described above.
Let me know if there is any more information I can provide.
Thanks!
Thanks
Dave
Omar Sandoval
2017-03-09 09:54:08 UTC
Permalink
Post by Dave Young
Add efi/kexec list.
[snip]
Post by Dave Young
I have no more clue yet from your provided log, but the runtime value is
arch/x86/platform/efi/efi.c: efi_systab_init()
efi_systab.runtime = data ?
(void *)(unsigne long)systab64->runtime;
Here data is the setup_data passed by kexec-tools from normal kernel to
struct efi_setup_data {
u64 fw_vendor;
u64 runtime;
u64 tables;
u64 smbios;
u64 reserved[8];
};
kexec-tools get the runtime address from /sys/firmware/efi/runtime
So can you do some debuggin on your side, eg. see the sysfs runtime
value is correct or not. And add some printk in efi init path etc.
The attached patch fixes this for me.
Ard Biesheuvel
2017-03-09 11:53:36 UTC
Permalink
Post by Omar Sandoval
Post by Dave Young
Add efi/kexec list.
[snip]
Post by Dave Young
I have no more clue yet from your provided log, but the runtime value is
arch/x86/platform/efi/efi.c: efi_systab_init()
efi_systab.runtime = data ?
(void *)(unsigne long)systab64->runtime;
Here data is the setup_data passed by kexec-tools from normal kernel to
struct efi_setup_data {
u64 fw_vendor;
u64 runtime;
u64 tables;
u64 smbios;
u64 reserved[8];
};
kexec-tools get the runtime address from /sys/firmware/efi/runtime
So can you do some debuggin on your side, eg. see the sysfs runtime
value is correct or not. And add some printk in efi init path etc.
The attached patch fixes this for me.
Hi Omar,

Thanks for tracking this down.

I wonder if this is an unintended side effect of the way we repurpose
the EFI_MEMORY_RUNTIME attribute in efi_arch_mem_reserve(). AFAIUI,
splitting memory map entries should only be necessary for regions that
are not runtime memory regions to begin with, and so whether their
virtual mapping address makes sense or not should be irrelevant.

Perhaps this only illustrates my lack of understanding of the x86 way
of doing this, so perhaps Matt can shed some light on this?

Thanks,
Ard.
Dave Young
2017-03-10 01:39:45 UTC
Permalink
Post by Ard Biesheuvel
Post by Omar Sandoval
Post by Dave Young
Add efi/kexec list.
[snip]
Post by Dave Young
I have no more clue yet from your provided log, but the runtime value is
arch/x86/platform/efi/efi.c: efi_systab_init()
efi_systab.runtime = data ?
(void *)(unsigne long)systab64->runtime;
Here data is the setup_data passed by kexec-tools from normal kernel to
struct efi_setup_data {
u64 fw_vendor;
u64 runtime;
u64 tables;
u64 smbios;
u64 reserved[8];
};
kexec-tools get the runtime address from /sys/firmware/efi/runtime
So can you do some debuggin on your side, eg. see the sysfs runtime
value is correct or not. And add some printk in efi init path etc.
The attached patch fixes this for me.
Hi Omar,
Thanks for tracking this down.
I wonder if this is an unintended side effect of the way we repurpose
the EFI_MEMORY_RUNTIME attribute in efi_arch_mem_reserve(). AFAIUI,
splitting memory map entries should only be necessary for regions that
are not runtime memory regions to begin with, and so whether their
virtual mapping address makes sense or not should be irrelevant.
In this case the esrt chunk are Runtime Data which is not necessary to
be reserved explicitly. I think efi_arch_mem_reserve are for boot areas.

Probably there could be esrt data which belongs to boot data? If we are
sure they are all runtime, the better fix may be just dropping the
efi_mem_reserve in esrt.c
Post by Ard Biesheuvel
Perhaps this only illustrates my lack of understanding of the x86 way
of doing this, so perhaps Matt can shed some light on this?
Thanks,
Ard.
Thanks
Dave
Matt Fleming
2017-03-16 12:15:32 UTC
Permalink
Post by Ard Biesheuvel
Hi Omar,
Thanks for tracking this down.
I wonder if this is an unintended side effect of the way we repurpose
the EFI_MEMORY_RUNTIME attribute in efi_arch_mem_reserve(). AFAIUI,
splitting memory map entries should only be necessary for regions that
are not runtime memory regions to begin with, and so whether their
virtual mapping address makes sense or not should be irrelevant.
Perhaps this only illustrates my lack of understanding of the x86 way
of doing this, so perhaps Matt can shed some light on this?
Sorry for the delay.

Yes, Ard is correct. It's not necessary to split/reserve memory
regions that already have the EFI_MEMORY_RUNTIME attribute.
Dave Young
2017-03-10 01:42:56 UTC
Permalink
Post by Omar Sandoval
Post by Dave Young
Add efi/kexec list.
[snip]
Post by Dave Young
I have no more clue yet from your provided log, but the runtime value is
arch/x86/platform/efi/efi.c: efi_systab_init()
efi_systab.runtime = data ?
(void *)(unsigne long)systab64->runtime;
Here data is the setup_data passed by kexec-tools from normal kernel to
struct efi_setup_data {
u64 fw_vendor;
u64 runtime;
u64 tables;
u64 smbios;
u64 reserved[8];
};
kexec-tools get the runtime address from /sys/firmware/efi/runtime
So can you do some debuggin on your side, eg. see the sysfs runtime
value is correct or not. And add some printk in efi init path etc.
The attached patch fixes this for me.
From 4b343f0b0b408469f28c973ea52877797a166313 Mon Sep 17 00:00:00 2001
Date: Thu, 9 Mar 2017 01:46:19 -0800
Subject: [PATCH] efi: adjust virt_addr when splitting descriptors in
efi_memmap_insert()
When we split efi memory descriptors, we adjust the physical address but
not the virtual address it maps to. This leads to bogus memory mappings
later when these virtual addresses are used.
This fixes a kexec boot regression since 8e80632fb23f ("efi/esrt: Use
efi_mem_reserve() and avoid a kmalloc()"), although the bug was only
exposed by that commit.
---
drivers/firmware/efi/memmap.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
index 78686443cb37..ca614db76faf 100644
--- a/drivers/firmware/efi/memmap.c
+++ b/drivers/firmware/efi/memmap.c
@@ -298,6 +298,7 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
memcpy(new, old, old_memmap->desc_size);
md = new;
md->phys_addr = m_end + 1;
+ md->virt_addr += md->phys_addr - start;
md->num_pages = (end - md->phys_addr + 1) >>
EFI_PAGE_SHIFT;
}
@@ -312,6 +313,7 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
md = new;
md->attribute |= m_attr;
md->phys_addr = m_start;
+ md->virt_addr += md->phys_addr - start;
md->num_pages = (m_end - m_start + 1) >>
EFI_PAGE_SHIFT;
/* last part */
@@ -319,6 +321,7 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
memcpy(new, old, old_memmap->desc_size);
md = new;
md->phys_addr = m_end + 1;
+ md->virt_addr += md->phys_addr - start;
md->num_pages = (end - m_end) >>
EFI_PAGE_SHIFT;
}
@@ -333,6 +336,7 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
memcpy(new, old, old_memmap->desc_size);
md = new;
md->phys_addr = m_start;
+ md->virt_addr += md->phys_addr - start;
md->num_pages = (end - md->phys_addr + 1) >>
EFI_PAGE_SHIFT;
md->attribute |= m_attr;
--
2.12.0
Nice, thanks for the debugging, so the problem is clear now.

Just Runtime areas are not necessarily to be reserved, for boot areas no
need to update the virt address. But I'm not sure about the fakemem
usage of this.

So need comments from Matt..

Thanks
Dave
Dave Young
2017-03-13 07:37:48 UTC
Permalink
Post by Omar Sandoval
Post by Dave Young
Add efi/kexec list.
[snip]
Post by Dave Young
I have no more clue yet from your provided log, but the runtime value is
arch/x86/platform/efi/efi.c: efi_systab_init()
efi_systab.runtime = data ?
(void *)(unsigne long)systab64->runtime;
Here data is the setup_data passed by kexec-tools from normal kernel to
struct efi_setup_data {
u64 fw_vendor;
u64 runtime;
u64 tables;
u64 smbios;
u64 reserved[8];
};
kexec-tools get the runtime address from /sys/firmware/efi/runtime
So can you do some debuggin on your side, eg. see the sysfs runtime
value is correct or not. And add some printk in efi init path etc.
The attached patch fixes this for me.
Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
correct to be used in efi_arch_mem_reserve, if it passed your test, I
can rewrite patch log with more background and send it out:

for_each_efi_memory_desc(md) {
[snip]
if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
md->type != EFI_BOOT_SERVICES_DATA &&
md->type != EFI_RUNTIME_SERVICES_DATA) {
continue;
}

In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
data or runtime data, this is wrong for efi_mem_reserve, because we are
reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
running time. Just is happened to work and we did not capture the error.

Signed-off-by: Dave Young <***@redhat.com>
---
arch/x86/platform/efi/quirks.c | 4 +++-
drivers/firmware/efi/efi.c | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/efi.h | 1 +
3 files changed, 43 insertions(+), 1 deletion(-)

--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -191,7 +191,9 @@ void __init efi_arch_mem_reserve(phys_ad
int num_entries;
void *new;

- if (efi_mem_desc_lookup(addr, &md)) {
+ if (efi_md_lookup_boot_data(addr, &md)) {
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
return;
}
--- linux-x86.orig/drivers/firmware/efi/efi.c
+++ linux-x86/drivers/firmware/efi/efi.c
@@ -353,6 +353,45 @@ err_put:
subsys_initcall(efisubsys_init);

/*
+ * Find the efi memory descriptor for a given physical address.
+ * Given a physical address, if it exists within an EFI memory map
+ * entry of type EFI_BOOT_SERVICES_DATA and the entry has no attribute
+ * EFI_MEMORY_RUNTIME, then populate the supplied memory descriptor
+ * with the appropriate data.
+ */
+int __init efi_md_lookup_boot_data(u64 phys_addr,
+ efi_memory_desc_t *out_md)
+{
+ efi_memory_desc_t *md;
+
+ if (!efi_enabled(EFI_MEMMAP)) {
+ pr_err_once("EFI_MEMMAP is not enabled.\n");
+ return -EINVAL;
+ }
+
+ if (!out_md) {
+ pr_err_once("out_md is null.\n");
+ return -EINVAL;
+ }
+
+ for_each_efi_memory_desc(md) {
+ u64 size;
+ u64 end;
+
+ if (md->type != EFI_BOOT_SERVICES_DATA)
+ continue;
+
+ size = md->num_pages << EFI_PAGE_SHIFT;
+ end = md->phys_addr + size;
+ if (phys_addr >= md->phys_addr && phys_addr < end) {
+ memcpy(out_md, md, sizeof(*out_md));
+ return 0;
+ }
+ }
+ return -ENOENT;
+}
+
+/*
* Find the efi memory descriptor for a given physical address. Given a
* physical address, determine if it exists within an EFI Memory Map entry,
* and if so, populate the supplied memory descriptor with the appropriate
--- linux-x86.orig/include/linux/efi.h
+++ linux-x86/include/linux/efi.h
@@ -979,6 +979,7 @@ extern u64 efi_mem_attribute (unsigned l
extern int __init efi_uart_console_only (void);
extern u64 efi_mem_desc_end(efi_memory_desc_t *md);
extern int efi_mem_desc_lookup(u64 phys_addr, efi_memory_desc_t *out_md);
+extern int efi_md_lookup_boot_data(u64 phys_addr, efi_memory_desc_t *out_md);
extern void efi_mem_reserve(phys_addr_t addr, u64 size);
extern void efi_initialize_iomem_resources(struct resource *code_resource,
struct resource *data_resource, struct resource *bss_resource);
Matt Fleming
2017-03-16 12:41:32 UTC
Permalink
Post by Dave Young
Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
correct to be used in efi_arch_mem_reserve, if it passed your test, I
for_each_efi_memory_desc(md) {
[snip]
if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
md->type != EFI_BOOT_SERVICES_DATA &&
md->type != EFI_RUNTIME_SERVICES_DATA) {
continue;
}
In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
data or runtime data, this is wrong for efi_mem_reserve, because we are
reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
running time. Just is happened to work and we did not capture the error.
Wouldn't something like this be simpler?

---

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 30031d5293c4..cdfe8c628959 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
return;
}

+ /* No need to reserve regions that will never be freed. */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
size += addr % EFI_PAGE_SIZE;
size = round_up(size, EFI_PAGE_SIZE);
addr = round_down(addr, EFI_PAGE_SIZE);
Omar Sandoval
2017-03-16 17:50:48 UTC
Permalink
Post by Matt Fleming
Post by Dave Young
Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
correct to be used in efi_arch_mem_reserve, if it passed your test, I
for_each_efi_memory_desc(md) {
[snip]
if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
md->type != EFI_BOOT_SERVICES_DATA &&
md->type != EFI_RUNTIME_SERVICES_DATA) {
continue;
}
In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
data or runtime data, this is wrong for efi_mem_reserve, because we are
reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
running time. Just is happened to work and we did not capture the error.
Wouldn't something like this be simpler?
---
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 30031d5293c4..cdfe8c628959 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
return;
}
+ /* No need to reserve regions that will never be freed. */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
size += addr % EFI_PAGE_SIZE;
size = round_up(size, EFI_PAGE_SIZE);
addr = round_down(addr, EFI_PAGE_SIZE);
This works for me.

Reported-and-tested-by: Omar Sandoval <***@fb.com>
Omar Sandoval
2017-04-03 23:54:34 UTC
Permalink
Post by Omar Sandoval
Post by Matt Fleming
Post by Dave Young
Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
correct to be used in efi_arch_mem_reserve, if it passed your test, I
for_each_efi_memory_desc(md) {
[snip]
if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
md->type != EFI_BOOT_SERVICES_DATA &&
md->type != EFI_RUNTIME_SERVICES_DATA) {
continue;
}
In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
data or runtime data, this is wrong for efi_mem_reserve, because we are
reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
running time. Just is happened to work and we did not capture the error.
Wouldn't something like this be simpler?
---
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 30031d5293c4..cdfe8c628959 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
return;
}
+ /* No need to reserve regions that will never be freed. */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
size += addr % EFI_PAGE_SIZE;
size = round_up(size, EFI_PAGE_SIZE);
addr = round_down(addr, EFI_PAGE_SIZE);
This works for me.
Is this going to go in for 4.11?
Dave Young
2017-03-17 02:09:51 UTC
Permalink
Post by Matt Fleming
Post by Dave Young
Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
correct to be used in efi_arch_mem_reserve, if it passed your test, I
for_each_efi_memory_desc(md) {
[snip]
if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
md->type != EFI_BOOT_SERVICES_DATA &&
md->type != EFI_RUNTIME_SERVICES_DATA) {
continue;
}
In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
data or runtime data, this is wrong for efi_mem_reserve, because we are
reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
running time. Just is happened to work and we did not capture the error.
Wouldn't something like this be simpler?
---
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 30031d5293c4..cdfe8c628959 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
return;
}
+ /* No need to reserve regions that will never be freed. */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
Matt, I think it should be fine although I think the md type checking in
efi_mem_desc_lookup() is causing confusion and not easy to understand..

How about move the if chunk early like below because it seems no need
to sanity check the addr + size any more if the md is still RUNTIME?

--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -196,6 +196,10 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}

+ /* No need to reserve regions that will never be freed. */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
return;

Thanks
Dave
Ard Biesheuvel
2017-03-17 13:25:33 UTC
Permalink
Post by Dave Young
Post by Matt Fleming
Post by Dave Young
Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
correct to be used in efi_arch_mem_reserve, if it passed your test, I
for_each_efi_memory_desc(md) {
[snip]
if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
md->type != EFI_BOOT_SERVICES_DATA &&
md->type != EFI_RUNTIME_SERVICES_DATA) {
continue;
}
In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
data or runtime data, this is wrong for efi_mem_reserve, because we are
reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
running time. Just is happened to work and we did not capture the error.
Wouldn't something like this be simpler?
---
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 30031d5293c4..cdfe8c628959 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
return;
}
+ /* No need to reserve regions that will never be freed. */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
Matt, I think it should be fine although I think the md type checking in
efi_mem_desc_lookup() is causing confusion and not easy to understand..
How about move the if chunk early like below because it seems no need
to sanity check the addr + size any more if the md is still RUNTIME?
--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -196,6 +196,10 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}
+ /* No need to reserve regions that will never be freed. */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
return;
This way, we suppress the error it the region spans multiple
descriptors, and only the first one has the runtime attribute. So the
two patches are not equivalent. I don't have a strong preference
either way, though.
Matt Fleming
2017-03-17 13:32:32 UTC
Permalink
Post by Dave Young
Matt, I think it should be fine although I think the md type checking in
efi_mem_desc_lookup() is causing confusion and not easy to understand..
Could you make that a separate patch if you think of improvements
there?
Post by Dave Young
How about move the if chunk early like below because it seems no need
to sanity check the addr + size any more if the md is still RUNTIME?
My original version did as you suggest, but I changed it because we
*really* want to know if someone tries to reserve a range that spans
regions. That would be totally unexpected and a warning about a
potential bug/issue.
Dave Young
2017-03-20 02:14:12 UTC
Permalink
Post by Matt Fleming
Post by Dave Young
Matt, I think it should be fine although I think the md type checking in
efi_mem_desc_lookup() is causing confusion and not easy to understand..
Could you make that a separate patch if you think of improvements
there?
Duplicate the lookup function is indeed a little ugly, will do it when I
have a better idea, we can leave it as is since it works.
Post by Matt Fleming
Post by Dave Young
How about move the if chunk early like below because it seems no need
to sanity check the addr + size any more if the md is still RUNTIME?
My original version did as you suggest, but I changed it because we
*really* want to know if someone tries to reserve a range that spans
regions. That would be totally unexpected and a warning about a
potential bug/issue.
Matt, I'm fine if you prefer to capture the range checking errors.
Would you like me to post it or just you send it out?

Thanks
Dave
Dave Young
2017-03-21 07:48:26 UTC
Permalink
Post by Dave Young
Post by Matt Fleming
Post by Dave Young
Matt, I think it should be fine although I think the md type checking in
efi_mem_desc_lookup() is causing confusion and not easy to understand..
Could you make that a separate patch if you think of improvements
there?
Duplicate the lookup function is indeed a little ugly, will do it when I
have a better idea, we can leave it as is since it works.
Matt, rethinking about this, how about doint something below, not
tested, just seeking for idea and opinons, in this way no need duplicate
a function, but there is an assumption that no overlapped mem ranges in
efi memmap.

Also the case Omar reported is the esrt memory range type is
RUNTIME_DATA, that is a little different with the mem attribute of
RUNTIME which also includes BOOT_DATA which has been set the RUNTIME
attribute, like bgrt in kexec reboot. Should we distinguish the two
cases and give out some warnings or debug info?


---
arch/x86/platform/efi/quirks.c | 5 +++++
drivers/firmware/efi/efi.c | 6 ------
drivers/firmware/efi/esrt.c | 7 +++++++
3 files changed, 12 insertions(+), 6 deletions(-)

--- linux-x86.orig/drivers/firmware/efi/efi.c
+++ linux-x86/drivers/firmware/efi/efi.c
@@ -376,12 +376,6 @@ int __init efi_mem_desc_lookup(u64 phys_
u64 size;
u64 end;

- if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
- md->type != EFI_BOOT_SERVICES_DATA &&
- md->type != EFI_RUNTIME_SERVICES_DATA) {
- continue;
- }
-
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;
if (phys_addr >= md->phys_addr && phys_addr < end) {
--- linux-x86.orig/drivers/firmware/efi/esrt.c
+++ linux-x86/drivers/firmware/efi/esrt.c
@@ -258,6 +258,13 @@ void __init efi_esrt_init(void)
return;
}

+ if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
+ md->type != EFI_BOOT_SERVICES_DATA &&
+ md->type != EFI_RUNTIME_SERVICES_DATA) {
+ pr_err("ESRT header memory map type is invalid\n");
+ return;
+ }
+
max = efi_mem_desc_end(&md);
if (max < efi.esrt) {
pr_err("EFI memory descriptor is invalid. (esrt: %p max: %p)\n",
--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,11 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}

+ if (md->attribute & EFI_MEMORY_RUNTIME ||
+ md->type != EFI_BOOT_SERVICES_DATA) {
+ return;
+ }
+
size += addr % EFI_PAGE_SIZE;
size = round_up(size, EFI_PAGE_SIZE);
addr = round_down(addr, EFI_PAGE_SIZE);
Post by Dave Young
Post by Matt Fleming
Post by Dave Young
How about move the if chunk early like below because it seems no need
to sanity check the addr + size any more if the md is still RUNTIME?
My original version did as you suggest, but I changed it because we
*really* want to know if someone tries to reserve a range that spans
regions. That would be totally unexpected and a warning about a
potential bug/issue.
Matt, I'm fine if you prefer to capture the range checking errors.
Would you like me to post it or just you send it out?
Thanks
Dave
Thanks
Dave
Ard Biesheuvel
2017-03-22 16:10:51 UTC
Permalink
Post by Dave Young
Post by Dave Young
Post by Matt Fleming
Post by Dave Young
Matt, I think it should be fine although I think the md type checking in
efi_mem_desc_lookup() is causing confusion and not easy to understand..
Could you make that a separate patch if you think of improvements
there?
Duplicate the lookup function is indeed a little ugly, will do it when I
have a better idea, we can leave it as is since it works.
Matt, rethinking about this, how about doint something below, not
tested, just seeking for idea and opinons, in this way no need duplicate
a function, but there is an assumption that no overlapped mem ranges in
efi memmap.
Also the case Omar reported is the esrt memory range type is
RUNTIME_DATA, that is a little different with the mem attribute of
RUNTIME which also includes BOOT_DATA which has been set the RUNTIME
attribute, like bgrt in kexec reboot. Should we distinguish the two
cases and give out some warnings or debug info?
---
arch/x86/platform/efi/quirks.c | 5 +++++
drivers/firmware/efi/efi.c | 6 ------
drivers/firmware/efi/esrt.c | 7 +++++++
3 files changed, 12 insertions(+), 6 deletions(-)
--- linux-x86.orig/drivers/firmware/efi/efi.c
+++ linux-x86/drivers/firmware/efi/efi.c
@@ -376,12 +376,6 @@ int __init efi_mem_desc_lookup(u64 phys_
u64 size;
u64 end;
- if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
- md->type != EFI_BOOT_SERVICES_DATA &&
- md->type != EFI_RUNTIME_SERVICES_DATA) {
- continue;
- }
-
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;
if (phys_addr >= md->phys_addr && phys_addr < end) {
--- linux-x86.orig/drivers/firmware/efi/esrt.c
+++ linux-x86/drivers/firmware/efi/esrt.c
@@ -258,6 +258,13 @@ void __init efi_esrt_init(void)
return;
}
+ if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
+ md->type != EFI_BOOT_SERVICES_DATA &&
+ md->type != EFI_RUNTIME_SERVICES_DATA) {
+ pr_err("ESRT header memory map type is invalid\n");
+ return;
+ }
+
This looks wrong to me. While the meanings get convoluted in practice,
the EFI_MEMORY_RUNTIME attribute only means that the firmware requests
a virtual mapping for the region. It is perfectly legal for a
EFI_RUNTIME_SERVICES_DATA region not to have the EFI_MEMORY_RUNTIME
attribute, if the region is never accessed by the runtime services
themselves, and this is not entirely unlikely for tables that the
firmware exposes to the OS
Post by Dave Young
max = efi_mem_desc_end(&md);
if (max < efi.esrt) {
pr_err("EFI memory descriptor is invalid. (esrt: %p max: %p)\n",
--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,11 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}
+ if (md->attribute & EFI_MEMORY_RUNTIME ||
+ md->type != EFI_BOOT_SERVICES_DATA) {
+ return;
+ }
+
size += addr % EFI_PAGE_SIZE;
size = round_up(size, EFI_PAGE_SIZE);
addr = round_down(addr, EFI_PAGE_SIZE);
Post by Dave Young
Post by Matt Fleming
Post by Dave Young
How about move the if chunk early like below because it seems no need
to sanity check the addr + size any more if the md is still RUNTIME?
My original version did as you suggest, but I changed it because we
*really* want to know if someone tries to reserve a range that spans
regions. That would be totally unexpected and a warning about a
potential bug/issue.
Matt, I'm fine if you prefer to capture the range checking errors.
Would you like me to post it or just you send it out?
Thanks
Dave
Thanks
Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Young
2017-03-23 02:43:27 UTC
Permalink
Post by Ard Biesheuvel
Post by Dave Young
Post by Dave Young
Post by Matt Fleming
Post by Dave Young
Matt, I think it should be fine although I think the md type checking in
efi_mem_desc_lookup() is causing confusion and not easy to understand..
Could you make that a separate patch if you think of improvements
there?
Duplicate the lookup function is indeed a little ugly, will do it when I
have a better idea, we can leave it as is since it works.
Matt, rethinking about this, how about doint something below, not
tested, just seeking for idea and opinons, in this way no need duplicate
a function, but there is an assumption that no overlapped mem ranges in
efi memmap.
Also the case Omar reported is the esrt memory range type is
RUNTIME_DATA, that is a little different with the mem attribute of
RUNTIME which also includes BOOT_DATA which has been set the RUNTIME
attribute, like bgrt in kexec reboot. Should we distinguish the two
cases and give out some warnings or debug info?
---
arch/x86/platform/efi/quirks.c | 5 +++++
drivers/firmware/efi/efi.c | 6 ------
drivers/firmware/efi/esrt.c | 7 +++++++
3 files changed, 12 insertions(+), 6 deletions(-)
--- linux-x86.orig/drivers/firmware/efi/efi.c
+++ linux-x86/drivers/firmware/efi/efi.c
@@ -376,12 +376,6 @@ int __init efi_mem_desc_lookup(u64 phys_
u64 size;
u64 end;
- if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
- md->type != EFI_BOOT_SERVICES_DATA &&
- md->type != EFI_RUNTIME_SERVICES_DATA) {
- continue;
- }
-
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;
if (phys_addr >= md->phys_addr && phys_addr < end) {
--- linux-x86.orig/drivers/firmware/efi/esrt.c
+++ linux-x86/drivers/firmware/efi/esrt.c
@@ -258,6 +258,13 @@ void __init efi_esrt_init(void)
return;
}
+ if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
+ md->type != EFI_BOOT_SERVICES_DATA &&
+ md->type != EFI_RUNTIME_SERVICES_DATA) {
+ pr_err("ESRT header memory map type is invalid\n");
+ return;
+ }
+
This looks wrong to me. While the meanings get convoluted in practice,
the EFI_MEMORY_RUNTIME attribute only means that the firmware requests
a virtual mapping for the region. It is perfectly legal for a
EFI_RUNTIME_SERVICES_DATA region not to have the EFI_MEMORY_RUNTIME
attribute, if the region is never accessed by the runtime services
themselves, and this is not entirely unlikely for tables that the
firmware exposes to the OS
Thanks for the comment, if so "!(md->attribute & EFI_MEMORY_RUNTIME) &&"
should be dropped.

BTW, md->type should be md.type, bgrt reserving works fine with this
change but I have no esrt machine to test it. I would like to wait for
Matt's opinions about this first before an update.

Also cc Peter about the esrt piece.
Post by Ard Biesheuvel
Post by Dave Young
max = efi_mem_desc_end(&md);
if (max < efi.esrt) {
pr_err("EFI memory descriptor is invalid. (esrt: %p max: %p)\n",
--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,11 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}
+ if (md->attribute & EFI_MEMORY_RUNTIME ||
+ md->type != EFI_BOOT_SERVICES_DATA) {
+ return;
+ }
+
size += addr % EFI_PAGE_SIZE;
size = round_up(size, EFI_PAGE_SIZE);
addr = round_down(addr, EFI_PAGE_SIZE);
Post by Dave Young
Post by Matt Fleming
Post by Dave Young
How about move the if chunk early like below because it seems no need
to sanity check the addr + size any more if the md is still RUNTIME?
My original version did as you suggest, but I changed it because we
*really* want to know if someone tries to reserve a range that spans
regions. That would be totally unexpected and a warning about a
potential bug/issue.
Matt, I'm fine if you prefer to capture the range checking errors.
Would you like me to post it or just you send it out?
Thanks
Dave
Thanks
Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Matt Fleming
2017-04-04 13:37:19 UTC
Permalink
Post by Dave Young
Matt, I'm fine if you prefer to capture the range checking errors.
Would you like me to post it or just you send it out?
Can you please send out the patch with the minimal change to
efi_arch_mem_reserve() and we'll get it into urgent ASAP.
Dave Young
2017-04-05 01:23:41 UTC
Permalink
Post by Matt Fleming
Post by Dave Young
Matt, I'm fine if you prefer to capture the range checking errors.
Would you like me to post it or just you send it out?
Can you please send out the patch with the minimal change to
efi_arch_mem_reserve() and we'll get it into urgent ASAP.
Omar has sent it out, for the lookup function issue I think I can do it
after this one later.

Thanks
Dave

Loading...