Discussion:
[PATCH v29 0/9] arm64: add kdump support
AKASHI Takahiro
2016-12-28 04:33:47 UTC
Permalink
This patch series adds kdump support on arm64.

To load a crash-dump kernel to the systems, a series of patches to
kexec-tools[1] are also needed. Please use the latest one, v4 [2].
For your convinience, you can pick them up from:
https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump

To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
- crash utility (v7.1.7 or later) [3]
(As of today, the very latest tree cannot handle a core image correctly
due to the commit 7fd8329ba502 ("taint/module: Clean up global and module
taint flags handling")


The previous version, v27, was also:
Tested-by: Pratyush Anand <***@redhat.com> (mustang and seattle)
Tested-by: James Morse <***@arm.com> (Juno)


Changes for v29 (Dec 28, 2016)
o rebased to Linux-v4.10-rc1
o change asm constraints in crash_setup_regs() per Catalin

Changes for v28 (Nov 22, 2016)
o rebased to Linux-v4.9-rc6
o revamp patch #1 and merge memblock_cap_memory_range() with
memblock_mem_limit_remove_map()

Changes for v27 (Nov 1, 2016)
o rebased to Linux-v4.9-rc3
o revert v26 change, i.e. revive "linux,usable-memory-range" property
(patch #2/#3, updating patch #9)
o minor fixes per review comments (patch #3/#4/#6/#8)
o re-order patches and improve commit messages for readability

Changes for v26 (Sep 7, 2016):
o Use /reserved-memory instead of "linux,usable-memory-range" property
(dropping v25's patch#2 and #3, updating ex-patch#9.)

Changes for v25 (Aug 29, 2016):
o Rebase to Linux-4.8-rc4
o Use memremap() instead of ioremap_cache() [patch#5]

Changes for v24 (Aug 9, 2016):
o Rebase to Linux-4.8-rc1
o Update descriptions about newly added DT proerties

Changes for v23 (July 26, 2016):

o Move memblock_reserve() to a single place in reserve_crashkernel()
o Use cpu_park_loop() in ipi_cpu_crash_stop()
o Always enforce ARCH_LOW_ADDRESS_LIMIT to the memory range of crash kernel
o Re-implement fdt_enforce_memory_region() to remove non-reserve regions
(for ACPI) from usable memory at crash kernel

Changes for v22 (July 12, 2016):

o Export "crashkernel-base" and "crashkernel-size" via device-tree,
and add some descriptions about them in chosen.txt
o Rename "usable-memory" to "usable-memory-range" to avoid inconsistency
with powerpc's "usable-memory"
o Make cosmetic changes regarding "ifdef" usage
o Correct some wordings in kdump.txt

Changes for v21 (July 6, 2016):

o Remove kexec patches.
o Rebase to arm64's for-next/core (Linux-4.7-rc4 based).
o Clarify the description about kvm in kdump.txt.

See the link [4] for older changes.


[1] https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
[2] http://lists.infradead.org/pipermail/kexec/2016-November/017555.html
[3] https://github.com/crash-utility/crash.git
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-June/438780.html

AKASHI Takahiro (8):
memblock: add memblock_cap_memory_range()
arm64: limit memory regions based on DT property, usable-memory-range
arm64: kdump: reserve memory for crash dump kernel
arm64: kdump: implement machine_crash_shutdown()
arm64: kdump: add VMCOREINFO's for user-space tools
arm64: kdump: provide /proc/vmcore file
arm64: kdump: enable kdump in defconfig
Documentation: kdump: describe arm64 port

James Morse (1):
Documentation: dt: chosen properties for arm64 kdump

Documentation/devicetree/bindings/chosen.txt | 50 +++++++
Documentation/kdump/kdump.txt | 16 ++-
arch/arm64/Kconfig | 11 ++
arch/arm64/configs/defconfig | 1 +
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 42 +++++-
arch/arm64/include/asm/smp.h | 2 +
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/crash_dump.c | 71 ++++++++++
arch/arm64/kernel/machine_kexec.c | 67 ++++++++-
arch/arm64/kernel/setup.c | 7 +-
arch/arm64/kernel/smp.c | 63 +++++++++
arch/arm64/mm/init.c | 199 +++++++++++++++++++++++++++
include/linux/memblock.h | 1 +
mm/memblock.c | 44 ++++--
15 files changed, 555 insertions(+), 22 deletions(-)
create mode 100644 arch/arm64/kernel/crash_dump.c
--
2.11.0
AKASHI Takahiro
2016-12-28 04:35:25 UTC
Permalink
Add memblock_cap_memory_range() which will remove all the memblock regions
except the memory range specified in the arguments. In addition, rework is
done on memblock_mem_limit_remove_map() to re-implement it using
memblock_cap_memory_range().

This function, like memblock_mem_limit_remove_map(), will not remove
memblocks with MEMMAP_NOMAP attribute as they may be mapped and accessed
later as "device memory."
See the commit a571d4eb55d8 ("mm/memblock.c: add new infrastructure to
address the mem limit issue").

This function is used, in a succeeding patch in the series of arm64 kdump
suuport, to limit the range of usable memory, or System RAM, on crash dump
kernel.
(Please note that "mem=" parameter is of little use for this purpose.)

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Will Deacon <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
Cc: linux-***@kvack.org
Cc: Andrew Morton <***@linux-foundation.org>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 44 +++++++++++++++++++++++++++++---------------
2 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5b759c9acf97..fbfcacc50c29 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -333,6 +333,7 @@ phys_addr_t memblock_mem_size(unsigned long limit_pfn);
phys_addr_t memblock_start_of_DRAM(void);
phys_addr_t memblock_end_of_DRAM(void);
void memblock_enforce_memory_limit(phys_addr_t memory_limit);
+void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
void memblock_mem_limit_remove_map(phys_addr_t limit);
bool memblock_is_memory(phys_addr_t addr);
int memblock_is_map_memory(phys_addr_t addr);
diff --git a/mm/memblock.c b/mm/memblock.c
index 7608bc305936..fea1688fef60 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1514,11 +1514,37 @@ void __init memblock_enforce_memory_limit(phys_addr_t limit)
(phys_addr_t)ULLONG_MAX);
}

+void __init memblock_cap_memory_range(phys_addr_t base, phys_addr_t size)
+{
+ int start_rgn, end_rgn;
+ int i, ret;
+
+ if (!size)
+ return;
+
+ ret = memblock_isolate_range(&memblock.memory, base, size,
+ &start_rgn, &end_rgn);
+ if (ret)
+ return;
+
+ /* remove all the MAP regions */
+ for (i = memblock.memory.cnt - 1; i >= end_rgn; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ for (i = start_rgn - 1; i >= 0; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ /* truncate the reserved regions */
+ memblock_remove_range(&memblock.reserved, 0, base);
+ memblock_remove_range(&memblock.reserved,
+ base + size, (phys_addr_t)ULLONG_MAX);
+}
+
void __init memblock_mem_limit_remove_map(phys_addr_t limit)
{
- struct memblock_type *type = &memblock.memory;
phys_addr_t max_addr;
- int i, ret, start_rgn, end_rgn;

if (!limit)
return;
@@ -1529,19 +1555,7 @@ void __init memblock_mem_limit_remove_map(phys_addr_t limit)
if (max_addr == (phys_addr_t)ULLONG_MAX)
return;

- ret = memblock_isolate_range(type, max_addr, (phys_addr_t)ULLONG_MAX,
- &start_rgn, &end_rgn);
- if (ret)
- return;
-
- /* remove all the MAP regions above the limit */
- for (i = end_rgn - 1; i >= start_rgn; i--) {
- if (!memblock_is_nomap(&type->regions[i]))
- memblock_remove_region(type, i);
- }
- /* truncate the reserved regions */
- memblock_remove_range(&memblock.reserved, max_addr,
- (phys_addr_t)ULLONG_MAX);
+ memblock_cap_memory_range(0, max_addr);
}

static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
--
2.11.0
Will Deacon
2017-01-10 11:16:18 UTC
Permalink
[adding some more folks to cc]
Post by AKASHI Takahiro
Add memblock_cap_memory_range() which will remove all the memblock regions
except the memory range specified in the arguments. In addition, rework is
done on memblock_mem_limit_remove_map() to re-implement it using
memblock_cap_memory_range().
This function, like memblock_mem_limit_remove_map(), will not remove
memblocks with MEMMAP_NOMAP attribute as they may be mapped and accessed
later as "device memory."
See the commit a571d4eb55d8 ("mm/memblock.c: add new infrastructure to
address the mem limit issue").
This function is used, in a succeeding patch in the series of arm64 kdump
suuport, to limit the range of usable memory, or System RAM, on crash dump
kernel.
(Please note that "mem=" parameter is of little use for this purpose.)
---
include/linux/memblock.h | 1 +
mm/memblock.c | 44 +++++++++++++++++++++++++++++---------------
2 files changed, 30 insertions(+), 15 deletions(-)
Whilst this patch looks fine to me, it would be nice if Dennis (author of
memblock_mem_limit_remove_map) or one of the mm chaps could provide an ack
before I merge it via arm64.

Thanks,

Will
Post by AKASHI Takahiro
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5b759c9acf97..fbfcacc50c29 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -333,6 +333,7 @@ phys_addr_t memblock_mem_size(unsigned long limit_pfn);
phys_addr_t memblock_start_of_DRAM(void);
phys_addr_t memblock_end_of_DRAM(void);
void memblock_enforce_memory_limit(phys_addr_t memory_limit);
+void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
void memblock_mem_limit_remove_map(phys_addr_t limit);
bool memblock_is_memory(phys_addr_t addr);
int memblock_is_map_memory(phys_addr_t addr);
diff --git a/mm/memblock.c b/mm/memblock.c
index 7608bc305936..fea1688fef60 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1514,11 +1514,37 @@ void __init memblock_enforce_memory_limit(phys_addr_t limit)
(phys_addr_t)ULLONG_MAX);
}
+void __init memblock_cap_memory_range(phys_addr_t base, phys_addr_t size)
+{
+ int start_rgn, end_rgn;
+ int i, ret;
+
+ if (!size)
+ return;
+
+ ret = memblock_isolate_range(&memblock.memory, base, size,
+ &start_rgn, &end_rgn);
+ if (ret)
+ return;
+
+ /* remove all the MAP regions */
+ for (i = memblock.memory.cnt - 1; i >= end_rgn; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ for (i = start_rgn - 1; i >= 0; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ /* truncate the reserved regions */
+ memblock_remove_range(&memblock.reserved, 0, base);
+ memblock_remove_range(&memblock.reserved,
+ base + size, (phys_addr_t)ULLONG_MAX);
+}
+
void __init memblock_mem_limit_remove_map(phys_addr_t limit)
{
- struct memblock_type *type = &memblock.memory;
phys_addr_t max_addr;
- int i, ret, start_rgn, end_rgn;
if (!limit)
return;
@@ -1529,19 +1555,7 @@ void __init memblock_mem_limit_remove_map(phys_addr_t limit)
if (max_addr == (phys_addr_t)ULLONG_MAX)
return;
- ret = memblock_isolate_range(type, max_addr, (phys_addr_t)ULLONG_MAX,
- &start_rgn, &end_rgn);
- if (ret)
- return;
-
- /* remove all the MAP regions above the limit */
- for (i = end_rgn - 1; i >= start_rgn; i--) {
- if (!memblock_is_nomap(&type->regions[i]))
- memblock_remove_region(type, i);
- }
- /* truncate the reserved regions */
- memblock_remove_range(&memblock.reserved, max_addr,
- (phys_addr_t)ULLONG_MAX);
+ memblock_cap_memory_range(0, max_addr);
}
static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
--
2.11.0
Dennis Chen
2017-01-11 01:57:39 UTC
Permalink
Post by Will Deacon
[adding some more folks to cc]
Post by AKASHI Takahiro
Add memblock_cap_memory_range() which will remove all the memblock regions
except the memory range specified in the arguments. In addition, rework is
done on memblock_mem_limit_remove_map() to re-implement it using
memblock_cap_memory_range().
This function, like memblock_mem_limit_remove_map(), will not remove
memblocks with MEMMAP_NOMAP attribute as they may be mapped and accessed
later as "device memory."
See the commit a571d4eb55d8 ("mm/memblock.c: add new infrastructure to
address the mem limit issue").
This function is used, in a succeeding patch in the series of arm64 kdump
suuport, to limit the range of usable memory, or System RAM, on crash dump
kernel.
(Please note that "mem=" parameter is of little use for this purpose.)
---
include/linux/memblock.h | 1 +
mm/memblock.c | 44 +++++++++++++++++++++++++++++---------------
2 files changed, 30 insertions(+), 15 deletions(-)
Whilst this patch looks fine to me, it would be nice if Dennis (author of
memblock_mem_limit_remove_map) or one of the mm chaps could provide an ack
before I merge it via arm64.
Will feel free to add
Acked-by: Dennis Chen <***@arm.com>

Thanks,
Dennis
Post by Will Deacon
Thanks,
Will
Post by AKASHI Takahiro
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5b759c9acf97..fbfcacc50c29 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -333,6 +333,7 @@ phys_addr_t memblock_mem_size(unsigned long limit_pfn);
phys_addr_t memblock_start_of_DRAM(void);
phys_addr_t memblock_end_of_DRAM(void);
void memblock_enforce_memory_limit(phys_addr_t memory_limit);
+void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
void memblock_mem_limit_remove_map(phys_addr_t limit);
bool memblock_is_memory(phys_addr_t addr);
int memblock_is_map_memory(phys_addr_t addr);
diff --git a/mm/memblock.c b/mm/memblock.c
index 7608bc305936..fea1688fef60 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1514,11 +1514,37 @@ void __init memblock_enforce_memory_limit(phys_addr_t limit)
(phys_addr_t)ULLONG_MAX);
}
+void __init memblock_cap_memory_range(phys_addr_t base, phys_addr_t size)
+{
+ int start_rgn, end_rgn;
+ int i, ret;
+
+ if (!size)
+ return;
+
+ ret = memblock_isolate_range(&memblock.memory, base, size,
+ &start_rgn, &end_rgn);
+ if (ret)
+ return;
+
+ /* remove all the MAP regions */
+ for (i = memblock.memory.cnt - 1; i >= end_rgn; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ for (i = start_rgn - 1; i >= 0; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ /* truncate the reserved regions */
+ memblock_remove_range(&memblock.reserved, 0, base);
+ memblock_remove_range(&memblock.reserved,
+ base + size, (phys_addr_t)ULLONG_MAX);
+}
+
void __init memblock_mem_limit_remove_map(phys_addr_t limit)
{
- struct memblock_type *type = &memblock.memory;
phys_addr_t max_addr;
- int i, ret, start_rgn, end_rgn;
if (!limit)
return;
@@ -1529,19 +1555,7 @@ void __init memblock_mem_limit_remove_map(phys_addr_t limit)
if (max_addr == (phys_addr_t)ULLONG_MAX)
return;
- ret = memblock_isolate_range(type, max_addr, (phys_addr_t)ULLONG_MAX,
- &start_rgn, &end_rgn);
- if (ret)
- return;
-
- /* remove all the MAP regions above the limit */
- for (i = end_rgn - 1; i >= start_rgn; i--) {
- if (!memblock_is_nomap(&type->regions[i]))
- memblock_remove_region(type, i);
- }
- /* truncate the reserved regions */
- memblock_remove_range(&memblock.reserved, max_addr,
- (phys_addr_t)ULLONG_MAX);
+ memblock_cap_memory_range(0, max_addr);
}
static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
--
2.11.0
AKASHI Takahiro
2016-12-28 04:35:59 UTC
Permalink
Crash dump kernel utilizes only a subset of available memory as System RAM.
On arm64 kdump, This memory range is advertized to crash dump kernel via
a device-tree property under /chosen,
linux,usable-memory-range = <BASE SIZE>

Crash dump kernel reads this property at boot time and calls
memblock_cap_memory_range() to limit usable memory ranges which are
described as entries in UEFI memory map table or "memory" nodes in
a device tree blob.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Geoff Levand <***@infradead.org>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/mm/init.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 212c4d1e2f26..65f1241c372c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -187,10 +187,45 @@ static int __init early_mem(char *p)
}
early_param("mem", early_mem);

+static int __init early_init_dt_scan_usablemem(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ struct memblock_region *usablemem = (struct memblock_region *)data;
+ const __be32 *reg;
+ int len;
+
+ usablemem->size = 0;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,usable-memory-range", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ usablemem->base = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ usablemem->size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+static void __init fdt_enforce_memory_region(void)
+{
+ struct memblock_region reg;
+
+ of_scan_flat_dt(early_init_dt_scan_usablemem, &reg);
+
+ if (reg.size)
+ memblock_cap_memory_range(reg.base, reg.size);
+}
+
void __init arm64_memblock_init(void)
{
const s64 linear_region_size = -(s64)PAGE_OFFSET;

+ /* Handle linux,usable-memory-range property */
+ fdt_enforce_memory_region();
+
/*
* Ensure that the linear region takes up exactly half of the kernel
* virtual address space. This way, we can distinguish a linear address
--
2.11.0
AKASHI Takahiro
2016-12-28 04:36:00 UTC
Permalink
"crashkernel=" kernel parameter specifies the size (and optionally
the start address) of the system ram used by crash dump kernel.
reserve_crashkernel() will allocate and reserve the memory at the startup
of primary kernel.

This memory range will be exported to userspace via:
- an entry named "Crash kernel" in /proc/iomem, and
- "linux,crashkernel-base" and "linux,crashkernel-size" under
/sys/firmware/devicetree/base/chosen

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Signed-off-by: Mark Salter <***@redhat.com>
Signed-off-by: Pratyush Anand <***@redhat.com>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/kernel/setup.c | 7 ++-
arch/arm64/mm/init.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 116 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index b051367e2149..4083069057b5 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -31,7 +31,6 @@
#include <linux/screen_info.h>
#include <linux/init.h>
#include <linux/kexec.h>
-#include <linux/crash_dump.h>
#include <linux/root_dev.h>
#include <linux/cpu.h>
#include <linux/interrupt.h>
@@ -225,6 +224,12 @@ static void __init request_standard_resources(void)
kernel_data.end <= res->end)
request_resource(res, &kernel_data);
}
+
+#ifdef CONFIG_KEXEC_CORE
+ /* User space tools will find "Crash kernel" region in /proc/iomem. */
+ if (crashk_res.end)
+ insert_resource(&iomem_resource, &crashk_res);
+#endif
}

u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 65f1241c372c..1d62bf71b531 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -30,12 +30,14 @@
#include <linux/gfp.h>
#include <linux/memblock.h>
#include <linux/sort.h>
+#include <linux/of.h>
#include <linux/of_fdt.h>
#include <linux/dma-mapping.h>
#include <linux/dma-contiguous.h>
#include <linux/efi.h>
#include <linux/swiotlb.h>
#include <linux/vmalloc.h>
+#include <linux/kexec.h>

#include <asm/boot.h>
#include <asm/fixmap.h>
@@ -76,6 +78,111 @@ static int __init early_initrd(char *p)
early_param("initrd", early_initrd);
#endif

+#ifdef CONFIG_KEXEC_CORE
+static unsigned long long crash_size, crash_base;
+static struct property crash_base_prop = {
+ .name = "linux,crashkernel-base",
+ .length = sizeof(u64),
+ .value = &crash_base
+};
+static struct property crash_size_prop = {
+ .name = "linux,crashkernel-size",
+ .length = sizeof(u64),
+ .value = &crash_size,
+};
+
+static int __init export_crashkernel(void)
+{
+ struct device_node *node;
+ int ret;
+
+ if (!crash_size)
+ return 0;
+
+ /* Add /chosen/linux,crashkernel-* properties */
+ node = of_find_node_by_path("/chosen");
+ if (!node)
+ return -ENOENT;
+
+ /*
+ * There might be existing crash kernel properties, but we can't
+ * be sure what's in them, so remove them.
+ */
+ of_remove_property(node, of_find_property(node,
+ "linux,crashkernel-base", NULL));
+ of_remove_property(node, of_find_property(node,
+ "linux,crashkernel-size", NULL));
+
+ ret = of_add_property(node, &crash_base_prop);
+ if (ret)
+ goto ret_err;
+
+ ret = of_add_property(node, &crash_size_prop);
+ if (ret)
+ goto ret_err;
+
+ return 0;
+
+ret_err:
+ pr_warn("Exporting crashkernel region to device tree failed\n");
+ return ret;
+}
+late_initcall(export_crashkernel);
+
+/*
+ * reserve_crashkernel() - reserves memory for crash kernel
+ *
+ * This function reserves memory area given in "crashkernel=" kernel command
+ * line parameter. The memory reserved is used by dump capture kernel when
+ * primary kernel is crashing.
+ */
+static void __init reserve_crashkernel(void)
+{
+ int ret;
+
+ ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+ &crash_size, &crash_base);
+ /* no crashkernel= or invalid value specified */
+ if (ret || !crash_size)
+ return;
+
+ if (crash_base == 0) {
+ /* Current arm64 boot protocol requires 2MB alignment */
+ crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
+ crash_size, SZ_2M);
+ if (crash_base == 0) {
+ pr_warn("Unable to allocate crashkernel (size:%llx)\n",
+ crash_size);
+ return;
+ }
+ } else {
+ /* User specifies base address explicitly. */
+ if (!memblock_is_region_memory(crash_base, crash_size) ||
+ memblock_is_region_reserved(crash_base, crash_size)) {
+ pr_warn("crashkernel has wrong address or size\n");
+ return;
+ }
+
+ if (!IS_ALIGNED(crash_base, SZ_2M)) {
+ pr_warn("crashkernel base address is not 2MB aligned\n");
+ return;
+ }
+ }
+ memblock_reserve(crash_base, crash_size);
+
+ pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
+ crash_size >> 20, crash_base >> 20);
+
+ crashk_res.start = crash_base;
+ crashk_res.end = crash_base + crash_size - 1;
+}
+#else
+static void __init reserve_crashkernel(void)
+{
+ ;
+}
+#endif /* CONFIG_KEXEC_CORE */
+
/*
* Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
* currently assumes that for memory starting above 4G, 32-bit devices will
@@ -331,6 +438,9 @@ void __init arm64_memblock_init(void)
arm64_dma_phys_limit = max_zone_dma_phys();
else
arm64_dma_phys_limit = PHYS_MASK + 1;
+
+ reserve_crashkernel();
+
dma_contiguous_reserve(arm64_dma_phys_limit);

memblock_allow_resize();
--
2.11.0
Mark Rutland
2017-01-12 15:09:26 UTC
Permalink
Hi,

As a general note, I must apologise for my minimial review of the series
until this point. Judging by the way the DT parts are organised. I'm
very concerned with the way the DT parts are organised, and clearly I
did not communicate my concerns and suggestions effectively in prior
rounds of review.
Post by AKASHI Takahiro
"crashkernel=" kernel parameter specifies the size (and optionally
the start address) of the system ram used by crash dump kernel.
reserve_crashkernel() will allocate and reserve the memory at the startup
of primary kernel.
- an entry named "Crash kernel" in /proc/iomem, and
- "linux,crashkernel-base" and "linux,crashkernel-size" under
/sys/firmware/devicetree/base/chosen
+#ifdef CONFIG_KEXEC_CORE
+static unsigned long long crash_size, crash_base;
+static struct property crash_base_prop = {
+ .name = "linux,crashkernel-base",
+ .length = sizeof(u64),
+ .value = &crash_base
+};
+static struct property crash_size_prop = {
+ .name = "linux,crashkernel-size",
+ .length = sizeof(u64),
+ .value = &crash_size,
+};
+
+static int __init export_crashkernel(void)
+{
+ struct device_node *node;
+ int ret;
+
+ if (!crash_size)
+ return 0;
+
+ /* Add /chosen/linux,crashkernel-* properties */
+ node = of_find_node_by_path("/chosen");
+ if (!node)
+ return -ENOENT;
+
+ /*
+ * There might be existing crash kernel properties, but we can't
+ * be sure what's in them, so remove them.
+ */
+ of_remove_property(node, of_find_property(node,
+ "linux,crashkernel-base", NULL));
+ of_remove_property(node, of_find_property(node,
+ "linux,crashkernel-size", NULL));
+
+ ret = of_add_property(node, &crash_base_prop);
+ if (ret)
+ goto ret_err;
+
+ ret = of_add_property(node, &crash_size_prop);
+ if (ret)
+ goto ret_err;
+
+ return 0;
+
+ pr_warn("Exporting crashkernel region to device tree failed\n");
+ return ret;
+}
+late_initcall(export_crashkernel);
I very much do not like this.

I don't think we should be modifying the DT exposed to userspace in this
manner, in the usual boot path, especially given that the kernel itself
does not appear to be a consumer of this property. I do not think that
it is right to use the DT exposed to userspace as a communication
channel solely between the kernel and userspace.

So I think we should drop the above, and for arm64 have userspace
consistently use /proc/iomem (or perhaps a new kexec-specific file) to
determine the region reserved for the crash kernel, if it needs to know
this.

I'll have further comments on this front in the binding patch.
Post by AKASHI Takahiro
+/*
+ * reserve_crashkernel() - reserves memory for crash kernel
+ *
+ * This function reserves memory area given in "crashkernel=" kernel command
+ * line parameter. The memory reserved is used by dump capture kernel when
+ * primary kernel is crashing.
+ */
+static void __init reserve_crashkernel(void)
+{
+ int ret;
+
+ ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+ &crash_size, &crash_base);
+ /* no crashkernel= or invalid value specified */
+ if (ret || !crash_size)
+ return;
+
+ if (crash_base == 0) {
+ /* Current arm64 boot protocol requires 2MB alignment */
+ crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
+ crash_size, SZ_2M);
+ if (crash_base == 0) {
+ pr_warn("Unable to allocate crashkernel (size:%llx)\n",
+ crash_size);
+ return;
+ }
+ } else {
+ /* User specifies base address explicitly. */
+ if (!memblock_is_region_memory(crash_base, crash_size) ||
+ memblock_is_region_reserved(crash_base, crash_size)) {
+ pr_warn("crashkernel has wrong address or size\n");
+ return;
+ }
+
+ if (!IS_ALIGNED(crash_base, SZ_2M)) {
+ pr_warn("crashkernel base address is not 2MB aligned\n");
+ return;
+ }
+ }
+ memblock_reserve(crash_base, crash_size);
This will mean that the crash kernel will have a permanent alias in the linear
map which is vulnerable to being clobbered. There could also be issues
with mismatched attributes in future.

We're probably ok for now, but in future we'll likely want to fix this
up to remove the region (or mark it nomap), and only map it temporarily
when loading things into the region.
Post by AKASHI Takahiro
+
+ pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
+ crash_size >> 20, crash_base >> 20);
+
+ crashk_res.start = crash_base;
+ crashk_res.end = crash_base + crash_size - 1;
+}
+#else
+static void __init reserve_crashkernel(void)
+{
+ ;
Nit: the ';' line can go.
Post by AKASHI Takahiro
+}
+#endif /* CONFIG_KEXEC_CORE */
+
/*
* Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
* currently assumes that for memory starting above 4G, 32-bit devices will
@@ -331,6 +438,9 @@ void __init arm64_memblock_init(void)
arm64_dma_phys_limit = max_zone_dma_phys();
else
arm64_dma_phys_limit = PHYS_MASK + 1;
+
+ reserve_crashkernel();
+
dma_contiguous_reserve(arm64_dma_phys_limit);
memblock_allow_resize();
--
2.11.0
Other than my comments regarding the DT usage above, this looks fine to
me.

Thanks,
Mark.
AKASHI Takahiro
2017-01-13 08:16:18 UTC
Permalink
Hi Mark,
Post by Mark Rutland
Hi,
As a general note, I must apologise for my minimial review of the series
until this point. Judging by the way the DT parts are organised. I'm
very concerned with the way the DT parts are organised, and clearly I
did not communicate my concerns and suggestions effectively in prior
rounds of review.
Post by AKASHI Takahiro
"crashkernel=" kernel parameter specifies the size (and optionally
the start address) of the system ram used by crash dump kernel.
reserve_crashkernel() will allocate and reserve the memory at the startup
of primary kernel.
- an entry named "Crash kernel" in /proc/iomem, and
- "linux,crashkernel-base" and "linux,crashkernel-size" under
/sys/firmware/devicetree/base/chosen
+#ifdef CONFIG_KEXEC_CORE
+static unsigned long long crash_size, crash_base;
+static struct property crash_base_prop = {
+ .name = "linux,crashkernel-base",
+ .length = sizeof(u64),
+ .value = &crash_base
+};
+static struct property crash_size_prop = {
+ .name = "linux,crashkernel-size",
+ .length = sizeof(u64),
+ .value = &crash_size,
+};
+
+static int __init export_crashkernel(void)
+{
+ struct device_node *node;
+ int ret;
+
+ if (!crash_size)
+ return 0;
+
+ /* Add /chosen/linux,crashkernel-* properties */
+ node = of_find_node_by_path("/chosen");
+ if (!node)
+ return -ENOENT;
+
+ /*
+ * There might be existing crash kernel properties, but we can't
+ * be sure what's in them, so remove them.
+ */
+ of_remove_property(node, of_find_property(node,
+ "linux,crashkernel-base", NULL));
+ of_remove_property(node, of_find_property(node,
+ "linux,crashkernel-size", NULL));
+
+ ret = of_add_property(node, &crash_base_prop);
+ if (ret)
+ goto ret_err;
+
+ ret = of_add_property(node, &crash_size_prop);
+ if (ret)
+ goto ret_err;
+
+ return 0;
+
+ pr_warn("Exporting crashkernel region to device tree failed\n");
+ return ret;
+}
+late_initcall(export_crashkernel);
I very much do not like this.
I don't think we should be modifying the DT exposed to userspace in this
manner, in the usual boot path, especially given that the kernel itself
does not appear to be a consumer of this property. I do not think that
it is right to use the DT exposed to userspace as a communication
channel solely between the kernel and userspace.
As you mentioned in your comments against my patch#9, this property
originates from PPC implementation.
I added it solely from the sympathy for dt-based architectures.
Post by Mark Rutland
So I think we should drop the above, and for arm64 have userspace
consistently use /proc/iomem (or perhaps a new kexec-specific file) to
determine the region reserved for the crash kernel, if it needs to know
this.
As a matter of fact, my port of kexec-tools doesn't check this property
and dropping it won't cause any problem.
Post by Mark Rutland
I'll have further comments on this front in the binding patch.
Post by AKASHI Takahiro
+/*
+ * reserve_crashkernel() - reserves memory for crash kernel
+ *
+ * This function reserves memory area given in "crashkernel=" kernel command
+ * line parameter. The memory reserved is used by dump capture kernel when
+ * primary kernel is crashing.
+ */
+static void __init reserve_crashkernel(void)
+{
+ int ret;
+
+ ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+ &crash_size, &crash_base);
+ /* no crashkernel= or invalid value specified */
+ if (ret || !crash_size)
+ return;
+
+ if (crash_base == 0) {
+ /* Current arm64 boot protocol requires 2MB alignment */
+ crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
+ crash_size, SZ_2M);
+ if (crash_base == 0) {
+ pr_warn("Unable to allocate crashkernel (size:%llx)\n",
+ crash_size);
+ return;
+ }
+ } else {
+ /* User specifies base address explicitly. */
+ if (!memblock_is_region_memory(crash_base, crash_size) ||
+ memblock_is_region_reserved(crash_base, crash_size)) {
+ pr_warn("crashkernel has wrong address or size\n");
+ return;
+ }
+
+ if (!IS_ALIGNED(crash_base, SZ_2M)) {
+ pr_warn("crashkernel base address is not 2MB aligned\n");
+ return;
+ }
+ }
+ memblock_reserve(crash_base, crash_size);
This will mean that the crash kernel will have a permanent alias in the linear
map which is vulnerable to being clobbered. There could also be issues
with mismatched attributes in future.
Good point, I've never thought of that except making the memblock
region "reserved."
Post by Mark Rutland
We're probably ok for now, but in future we'll likely want to fix this
up to remove the region (or mark it nomap), and only map it temporarily
when loading things into the region.
Well, I found that the following commit is already in:
commit 9b492cf58077
Author: Xunlei Pang <***@redhat.com>
Date: Mon May 23 16:24:10 2016 -0700

kexec: introduce a protection mechanism for the crashkernel
reserved memory

To make best use of this framework, I'd like to re-use set_memory_ro/rx()
instead of removing the region from linear mapping. But to do so,
we need to
* make memblock_isolate_range() global,
* allow set_memory_ro/rx() to be applied to regions in linear mapping
since set_memory_ro/rx() works only on page-level mappings.

What do you think?
(See my tentative solution below.)
Post by Mark Rutland
Post by AKASHI Takahiro
+
+ pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
+ crash_size >> 20, crash_base >> 20);
+
+ crashk_res.start = crash_base;
+ crashk_res.end = crash_base + crash_size - 1;
+}
+#else
+static void __init reserve_crashkernel(void)
+{
+ ;
Nit: the ';' line can go.
OK

Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Post by AKASHI Takahiro
+}
+#endif /* CONFIG_KEXEC_CORE */
+
/*
* Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
* currently assumes that for memory starting above 4G, 32-bit devices will
@@ -331,6 +438,9 @@ void __init arm64_memblock_init(void)
arm64_dma_phys_limit = max_zone_dma_phys();
else
arm64_dma_phys_limit = PHYS_MASK + 1;
+
+ reserve_crashkernel();
+
dma_contiguous_reserve(arm64_dma_phys_limit);
memblock_allow_resize();
--
2.11.0
Other than my comments regarding the DT usage above, this looks fine to
me.
Thanks,
Mark.
===8<===
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index c0fc3d458195..bb21c0473b8e 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -211,6 +211,44 @@ void machine_kexec(struct kimage *kimage)
BUG(); /* Should never get here. */
}

+static int kexec_mark_range(unsigned long start, unsigned long end,
+ bool protect)
+{
+ unsigned int nr_pages;
+
+ if (!end || start >= end)
+ return 0;
+
+ nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
+
+ if (protect)
+ return set_memory_ro(__phys_to_virt(start), nr_pages);
+ else
+ return set_memory_rw(__phys_to_virt(start), nr_pages);
+}
+
+static void kexec_mark_crashkres(bool protect)
+{
+ unsigned long control;
+
+ /* Don't touch the control code page used in crash_kexec().*/
+ control = page_to_phys(kexec_crash_image->control_code_page);
+ kexec_mark_range(crashk_res.start, control - 1, protect);
+
+ control += KEXEC_CONTROL_PAGE_SIZE;
+ kexec_mark_range(control, crashk_res.end, protect);
+}
+
+void arch_kexec_protect_crashkres(void)
+{
+ kexec_mark_crashkres(true);
+}
+
+void arch_kexec_unprotect_crashkres(void)
+{
+ kexec_mark_crashkres(false);
+}
+
static void machine_kexec_mask_interrupts(void)
{
unsigned int i;
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 569ec3325bc8..764ec89c4f76 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -90,6 +90,7 @@ early_param("initrd", early_initrd);
static void __init reserve_crashkernel(void)
{
unsigned long long crash_size, crash_base;
+ int start_rgn, end_rgn;
int ret;

ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
@@ -121,6 +122,9 @@ static void __init reserve_crashkernel(void)
}
}
memblock_reserve(crash_base, crash_size);
+ memblock_isolate_range(&memblock.memory, crash_base, crash_size,
+ &start_rgn, &end_rgn);
+

pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
crash_size >> 20, crash_base >> 20);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..0f60f19c287b 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
@@ -362,6 +363,17 @@ static void __init __map_memblock(pgd_t *pgd, phys_addr_t start, phys_addr_t end
unsigned long kernel_start = __pa(_text);
unsigned long kernel_end = __pa(__init_begin);

+#ifdef CONFIG_KEXEC_CORE
+ if (crashk_res.end && start >= crashk_res.start &&
+ end <= (crashk_res.end + 1)) {
+ __create_pgd_mapping(pgd, start, __phys_to_virt(start),
+ end - start, PAGE_KERNEL,
+ early_pgtable_alloc,
+ true);
+ return;
+ }
+#endif
+
/*
* Take care not to create a writable alias for the
* read-only text and rodata sections of the kernel image.
===>8===
Mark Rutland
2017-01-13 11:39:15 UTC
Permalink
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+static int __init export_crashkernel(void)
+ /* Add /chosen/linux,crashkernel-* properties */
+ of_remove_property(node, of_find_property(node,
+ "linux,crashkernel-base", NULL));
+ of_remove_property(node, of_find_property(node,
+ "linux,crashkernel-size", NULL));
+
+ ret = of_add_property(node, &crash_base_prop);
+ if (ret)
+ goto ret_err;
+
+ ret = of_add_property(node, &crash_size_prop);
+ if (ret)
+ goto ret_err;
I very much do not like this.
I don't think we should be modifying the DT exposed to userspace in this
manner, in the usual boot path, especially given that the kernel itself
does not appear to be a consumer of this property. I do not think that
it is right to use the DT exposed to userspace as a communication
channel solely between the kernel and userspace.
As you mentioned in your comments against my patch#9, this property
originates from PPC implementation.
I added it solely from the sympathy for dt-based architectures.
Post by Mark Rutland
So I think we should drop the above, and for arm64 have userspace
consistently use /proc/iomem (or perhaps a new kexec-specific file) to
determine the region reserved for the crash kernel, if it needs to know
this.
As a matter of fact, my port of kexec-tools doesn't check this property
and dropping it won't cause any problem.
Ok. It sounds like we're both happy for this to go, then.

While it's unfortunate that architectures differ, I think we have
legitimate reasons to differ, and it's preferable to do so. We have a
different set of constraints (e.g. supporting EFI memory maps), and
following the PPC approach creates longer term issues for us, making it
harder to do the right thing consistently.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+/*
+ * reserve_crashkernel() - reserves memory for crash kernel
+ *
+ * This function reserves memory area given in "crashkernel=" kernel command
+ * line parameter. The memory reserved is used by dump capture kernel when
+ * primary kernel is crashing.
+ */
+static void __init reserve_crashkernel(void)
+ memblock_reserve(crash_base, crash_size);
This will mean that the crash kernel will have a permanent alias in the linear
map which is vulnerable to being clobbered. There could also be issues
with mismatched attributes in future.
Good point, I've never thought of that except making the memblock
region "reserved."
Post by Mark Rutland
We're probably ok for now, but in future we'll likely want to fix this
up to remove the region (or mark it nomap), and only map it temporarily
when loading things into the region.
commit 9b492cf58077
Date: Mon May 23 16:24:10 2016 -0700
kexec: introduce a protection mechanism for the crashkernel
reserved memory
To make best use of this framework, I'd like to re-use set_memory_ro/rx()
instead of removing the region from linear mapping. But to do so,
we need to
* make memblock_isolate_range() global,
* allow set_memory_ro/rx() to be applied to regions in linear mapping
since set_memory_ro/rx() works only on page-level mappings.
What do you think?
(See my tentative solution below.)
Great! I think it would be better to follow the approach of
mark_rodata_ro(), rather than opening up set_memory_*(), but otherwise,
it looks like it should work.

Either way, this still leaves us with an RO alias on crashed cores (and
potential cache attribute mismatches in future). Do we need to read from
the region later, or could we unmap it entirely?

Thanks,
Mark.
Post by AKASHI Takahiro
===8<===
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index c0fc3d458195..bb21c0473b8e 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -211,6 +211,44 @@ void machine_kexec(struct kimage *kimage)
BUG(); /* Should never get here. */
}
+static int kexec_mark_range(unsigned long start, unsigned long end,
+ bool protect)
+{
+ unsigned int nr_pages;
+
+ if (!end || start >= end)
+ return 0;
+
+ nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
+
+ if (protect)
+ return set_memory_ro(__phys_to_virt(start), nr_pages);
+ else
+ return set_memory_rw(__phys_to_virt(start), nr_pages);
+}
+
+static void kexec_mark_crashkres(bool protect)
+{
+ unsigned long control;
+
+ /* Don't touch the control code page used in crash_kexec().*/
+ control = page_to_phys(kexec_crash_image->control_code_page);
+ kexec_mark_range(crashk_res.start, control - 1, protect);
+
+ control += KEXEC_CONTROL_PAGE_SIZE;
+ kexec_mark_range(control, crashk_res.end, protect);
+}
+
+void arch_kexec_protect_crashkres(void)
+{
+ kexec_mark_crashkres(true);
+}
+
+void arch_kexec_unprotect_crashkres(void)
+{
+ kexec_mark_crashkres(false);
+}
+
static void machine_kexec_mask_interrupts(void)
{
unsigned int i;
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 569ec3325bc8..764ec89c4f76 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -90,6 +90,7 @@ early_param("initrd", early_initrd);
static void __init reserve_crashkernel(void)
{
unsigned long long crash_size, crash_base;
+ int start_rgn, end_rgn;
int ret;
ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
@@ -121,6 +122,9 @@ static void __init reserve_crashkernel(void)
}
}
memblock_reserve(crash_base, crash_size);
+ memblock_isolate_range(&memblock.memory, crash_base, crash_size,
+ &start_rgn, &end_rgn);
+
pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
crash_size >> 20, crash_base >> 20);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..0f60f19c287b 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
@@ -362,6 +363,17 @@ static void __init __map_memblock(pgd_t *pgd, phys_addr_t start, phys_addr_t end
unsigned long kernel_start = __pa(_text);
unsigned long kernel_end = __pa(__init_begin);
+#ifdef CONFIG_KEXEC_CORE
+ if (crashk_res.end && start >= crashk_res.start &&
+ end <= (crashk_res.end + 1)) {
+ __create_pgd_mapping(pgd, start, __phys_to_virt(start),
+ end - start, PAGE_KERNEL,
+ early_pgtable_alloc,
+ true);
+ return;
+ }
+#endif
+
/*
* Take care not to create a writable alias for the
* read-only text and rodata sections of the kernel image.
===>8===
AKASHI Takahiro
2017-01-17 08:20:44 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+static int __init export_crashkernel(void)
+ /* Add /chosen/linux,crashkernel-* properties */
+ of_remove_property(node, of_find_property(node,
+ "linux,crashkernel-base", NULL));
+ of_remove_property(node, of_find_property(node,
+ "linux,crashkernel-size", NULL));
+
+ ret = of_add_property(node, &crash_base_prop);
+ if (ret)
+ goto ret_err;
+
+ ret = of_add_property(node, &crash_size_prop);
+ if (ret)
+ goto ret_err;
I very much do not like this.
I don't think we should be modifying the DT exposed to userspace in this
manner, in the usual boot path, especially given that the kernel itself
does not appear to be a consumer of this property. I do not think that
it is right to use the DT exposed to userspace as a communication
channel solely between the kernel and userspace.
As you mentioned in your comments against my patch#9, this property
originates from PPC implementation.
I added it solely from the sympathy for dt-based architectures.
Post by Mark Rutland
So I think we should drop the above, and for arm64 have userspace
consistently use /proc/iomem (or perhaps a new kexec-specific file) to
determine the region reserved for the crash kernel, if it needs to know
this.
As a matter of fact, my port of kexec-tools doesn't check this property
and dropping it won't cause any problem.
Ok. It sounds like we're both happy for this to go, then.
While it's unfortunate that architectures differ, I think we have
legitimate reasons to differ, and it's preferable to do so. We have a
different set of constraints (e.g. supporting EFI memory maps), and
following the PPC approach creates longer term issues for us, making it
harder to do the right thing consistently.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+/*
+ * reserve_crashkernel() - reserves memory for crash kernel
+ *
+ * This function reserves memory area given in "crashkernel=" kernel command
+ * line parameter. The memory reserved is used by dump capture kernel when
+ * primary kernel is crashing.
+ */
+static void __init reserve_crashkernel(void)
+ memblock_reserve(crash_base, crash_size);
This will mean that the crash kernel will have a permanent alias in the linear
map which is vulnerable to being clobbered. There could also be issues
with mismatched attributes in future.
Good point, I've never thought of that except making the memblock
region "reserved."
Post by Mark Rutland
We're probably ok for now, but in future we'll likely want to fix this
up to remove the region (or mark it nomap), and only map it temporarily
when loading things into the region.
commit 9b492cf58077
Date: Mon May 23 16:24:10 2016 -0700
kexec: introduce a protection mechanism for the crashkernel
reserved memory
To make best use of this framework, I'd like to re-use set_memory_ro/rx()
instead of removing the region from linear mapping. But to do so,
we need to
* make memblock_isolate_range() global,
* allow set_memory_ro/rx() to be applied to regions in linear mapping
since set_memory_ro/rx() works only on page-level mappings.
What do you think?
(See my tentative solution below.)
Great! I think it would be better to follow the approach of
mark_rodata_ro(), rather than opening up set_memory_*(), but otherwise,
it looks like it should work.
I'm not quite sure what the approach of mark_rodata_ro() means, but
I found that using create_mapping_late() may cause two problems:

1) it fails when PTE_CONT bits mismatch between an old and new mmu entry.
This can happen, say, if the memory range for crash dump kernel
starts in the mid of _continuous_ pages.

2) The control code page, of one-page size, is still written out in
machine_kexec() which is called at a crash, and this means that
the range must be writable even after kexec_load(), but
create_mapping_late() does not handle a case of changing attributes
for a single page which is in _section_ mapping.
We cannot make single-page mapping for the control page since the address
of that page is not determined at the boot time.

As for (1), we need to call memblock_isolate_range() to make the region
an independent one.
Post by Mark Rutland
Either way, this still leaves us with an RO alias on crashed cores (and
potential cache attribute mismatches in future). Do we need to read from
the region later,
I believe not, but the region must be _writable_ as I mentioned in (2) above.
To avoid this issue, we have to move copying the control code page
to machine_kexec_prepare() which is called in kexec_load() and so
the region is writable anyway then.
I want Geoff to affirm that this change is safe.

(See my second solution below.)
Post by Mark Rutland
or could we unmap it entirely?
given the change above, I think we can.
Is there any code to re-use especially for unmapping?

Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
===8<===
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index c0fc3d458195..80a52e9aaf73 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -26,8 +26,6 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;

-static unsigned long kimage_start;
-
/**
* kexec_image_info - For debugging output.
*/
@@ -68,7 +66,7 @@ void machine_kexec_cleanup(struct kimage *kimage)
*/
int machine_kexec_prepare(struct kimage *kimage)
{
- kimage_start = kimage->start;
+ void *reboot_code_buffer;

kexec_image_info(kimage);

@@ -77,6 +75,21 @@ int machine_kexec_prepare(struct kimage *kimage)
return -EBUSY;
}

+ reboot_code_buffer =
+ phys_to_virt(page_to_phys(kimage->control_code_page));
+
+ /*
+ * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
+ * after the kernel is shut down.
+ */
+ memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
+ arm64_relocate_new_kernel_size);
+
+ /* Flush the reboot_code_buffer in preparation for its execution. */
+ __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
+ flush_icache_range((uintptr_t)reboot_code_buffer,
+ arm64_relocate_new_kernel_size);
+
return 0;
}

@@ -147,7 +160,6 @@ static void kexec_segment_flush(const struct kimage *kimage)
void machine_kexec(struct kimage *kimage)
{
phys_addr_t reboot_code_buffer_phys;
- void *reboot_code_buffer;

/*
* New cpus may have become stuck_in_kernel after we loaded the image.
@@ -156,7 +168,6 @@ void machine_kexec(struct kimage *kimage)
!WARN_ON(kimage == kexec_crash_image));

reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
- reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);

kexec_image_info(kimage);

@@ -164,26 +175,12 @@ void machine_kexec(struct kimage *kimage)
kimage->control_code_page);
pr_debug("%s:%d: reboot_code_buffer_phys: %pa\n", __func__, __LINE__,
&reboot_code_buffer_phys);
- pr_debug("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__,
- reboot_code_buffer);
pr_debug("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__,
arm64_relocate_new_kernel);
pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
__func__, __LINE__, arm64_relocate_new_kernel_size,
arm64_relocate_new_kernel_size);

- /*
- * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
- * after the kernel is shut down.
- */
- memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
- arm64_relocate_new_kernel_size);
-
- /* Flush the reboot_code_buffer in preparation for its execution. */
- __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
- flush_icache_range((uintptr_t)reboot_code_buffer,
- arm64_relocate_new_kernel_size);
-
/* Flush the kimage list and its buffers. */
kexec_list_flush(kimage);

@@ -206,7 +203,7 @@ void machine_kexec(struct kimage *kimage)
*/

cpu_soft_restart(kimage != kexec_crash_image,
- reboot_code_buffer_phys, kimage->head, kimage_start, 0);
+ reboot_code_buffer_phys, kimage->head, kimage->start, 0);

BUG(); /* Should never get here. */
}
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 569ec3325bc8..e4cc170edc0c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -90,6 +90,7 @@ early_param("initrd", early_initrd);
static void __init reserve_crashkernel(void)
{
unsigned long long crash_size, crash_base;
+ int start_rgn, end_rgn;
int ret;

ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
@@ -120,6 +121,8 @@ static void __init reserve_crashkernel(void)
return;
}
}
+ memblock_isolate_range(&memblock.memory, crash_base, crash_size,
+ &start_rgn, &end_rgn);
memblock_reserve(crash_base, crash_size);

pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..b7c75845407a 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,8 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/ioport.h>
+#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
@@ -817,3 +819,27 @@ int pmd_clear_huge(pmd_t *pmd)
pmd_clear(pmd);
return 1;
}
+
+#ifdef CONFIG_KEXEC_CORE
+void arch_kexec_protect_crashkres(void)
+{
+ flush_tlb_all();
+
+ create_mapping_late(crashk_res.start, __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL_RO);
+
+ /* flush the TLBs after updating live kernel mappings */
+ flush_tlb_all();
+}
+
+void arch_kexec_unprotect_crashkres(void)
+{
+ flush_tlb_all();
+
+ create_mapping_late(crashk_res.start, __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL);
+
+ /* flush the TLBs after updating live kernel mappings */
+ flush_tlb_all();
+}
+#endif
===>8===
Mark Rutland
2017-01-17 11:54:42 UTC
Permalink
Post by AKASHI Takahiro
Post by Mark Rutland
Great! I think it would be better to follow the approach of
mark_rodata_ro(), rather than opening up set_memory_*(), but otherwise,
it looks like it should work.
I'm not quite sure what the approach of mark_rodata_ro() means, but
1) it fails when PTE_CONT bits mismatch between an old and new mmu entry.
This can happen, say, if the memory range for crash dump kernel
starts in the mid of _continuous_ pages.
That should only happen if we try to remap a segment different to what
we originally mapped.

I was intending that we'd explicitly map the reserved region separately
in the boot path, like we do for kernel segments in map_kernel(). We
would allow sections and/or CONT entires.

Then, in __map_memblock() we'd then skip that range as we do for the
linear map alias of the kernel image.

That way, we can later use create_mapping_late for that same region, and
it should handle sections and/or CONT entries in the exact same way as
it does for the kernel image segments in mark_rodata_ro().
Post by AKASHI Takahiro
2) The control code page, of one-page size, is still written out in
machine_kexec() which is called at a crash, and this means that
the range must be writable even after kexec_load(), but
create_mapping_late() does not handle a case of changing attributes
for a single page which is in _section_ mapping.
We cannot make single-page mapping for the control page since the address
of that page is not determined at the boot time.
That is a problem. I'm not sure I follow how set_memory_*() helps here
though?
Post by AKASHI Takahiro
As for (1), we need to call memblock_isolate_range() to make the region
an independent one.
Post by Mark Rutland
Either way, this still leaves us with an RO alias on crashed cores (and
potential cache attribute mismatches in future). Do we need to read from
the region later,
I believe not, but the region must be _writable_ as I mentioned in (2) above.
To avoid this issue, we have to move copying the control code page
to machine_kexec_prepare() which is called in kexec_load() and so
the region is writable anyway then.
I want Geoff to affirm that this change is safe.
(See my second solution below.)
From a quick scan that looks ok.
Post by AKASHI Takahiro
Post by Mark Rutland
or could we unmap it entirely?
given the change above, I think we can.
Great!
Post by AKASHI Takahiro
Is there any code to re-use especially for unmapping?
I don't think we have much code useful for unmapping. We could re-use
create_mapping_late for this, passing a set of prot bits that means the
entries are invalid (e.g. have a PAGE_KERNEL_INVALID).

We'd have to perform the TLB invalidation ourselves, but that shouldn't
be too painful.

Thanks,
Mark.
Post by AKASHI Takahiro
===8<===
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index c0fc3d458195..80a52e9aaf73 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -26,8 +26,6 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;
-static unsigned long kimage_start;
-
/**
* kexec_image_info - For debugging output.
*/
@@ -68,7 +66,7 @@ void machine_kexec_cleanup(struct kimage *kimage)
*/
int machine_kexec_prepare(struct kimage *kimage)
{
- kimage_start = kimage->start;
+ void *reboot_code_buffer;
kexec_image_info(kimage);
@@ -77,6 +75,21 @@ int machine_kexec_prepare(struct kimage *kimage)
return -EBUSY;
}
+ reboot_code_buffer =
+ phys_to_virt(page_to_phys(kimage->control_code_page));
+
+ /*
+ * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
+ * after the kernel is shut down.
+ */
+ memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
+ arm64_relocate_new_kernel_size);
+
+ /* Flush the reboot_code_buffer in preparation for its execution. */
+ __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
+ flush_icache_range((uintptr_t)reboot_code_buffer,
+ arm64_relocate_new_kernel_size);
+
return 0;
}
@@ -147,7 +160,6 @@ static void kexec_segment_flush(const struct kimage *kimage)
void machine_kexec(struct kimage *kimage)
{
phys_addr_t reboot_code_buffer_phys;
- void *reboot_code_buffer;
/*
* New cpus may have become stuck_in_kernel after we loaded the image.
@@ -156,7 +168,6 @@ void machine_kexec(struct kimage *kimage)
!WARN_ON(kimage == kexec_crash_image));
reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
- reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
kexec_image_info(kimage);
@@ -164,26 +175,12 @@ void machine_kexec(struct kimage *kimage)
kimage->control_code_page);
pr_debug("%s:%d: reboot_code_buffer_phys: %pa\n", __func__, __LINE__,
&reboot_code_buffer_phys);
- pr_debug("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__,
- reboot_code_buffer);
pr_debug("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__,
arm64_relocate_new_kernel);
pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
__func__, __LINE__, arm64_relocate_new_kernel_size,
arm64_relocate_new_kernel_size);
- /*
- * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
- * after the kernel is shut down.
- */
- memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
- arm64_relocate_new_kernel_size);
-
- /* Flush the reboot_code_buffer in preparation for its execution. */
- __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
- flush_icache_range((uintptr_t)reboot_code_buffer,
- arm64_relocate_new_kernel_size);
-
/* Flush the kimage list and its buffers. */
kexec_list_flush(kimage);
@@ -206,7 +203,7 @@ void machine_kexec(struct kimage *kimage)
*/
cpu_soft_restart(kimage != kexec_crash_image,
- reboot_code_buffer_phys, kimage->head, kimage_start, 0);
+ reboot_code_buffer_phys, kimage->head, kimage->start, 0);
BUG(); /* Should never get here. */
}
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 569ec3325bc8..e4cc170edc0c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -90,6 +90,7 @@ early_param("initrd", early_initrd);
static void __init reserve_crashkernel(void)
{
unsigned long long crash_size, crash_base;
+ int start_rgn, end_rgn;
int ret;
ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
@@ -120,6 +121,8 @@ static void __init reserve_crashkernel(void)
return;
}
}
+ memblock_isolate_range(&memblock.memory, crash_base, crash_size,
+ &start_rgn, &end_rgn);
memblock_reserve(crash_base, crash_size);
pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..b7c75845407a 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,8 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/ioport.h>
+#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
@@ -817,3 +819,27 @@ int pmd_clear_huge(pmd_t *pmd)
pmd_clear(pmd);
return 1;
}
+
+#ifdef CONFIG_KEXEC_CORE
+void arch_kexec_protect_crashkres(void)
+{
+ flush_tlb_all();
+
+ create_mapping_late(crashk_res.start, __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL_RO);
+
+ /* flush the TLBs after updating live kernel mappings */
+ flush_tlb_all();
+}
+
+void arch_kexec_unprotect_crashkres(void)
+{
+ flush_tlb_all();
+
+ create_mapping_late(crashk_res.start, __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL);
+
+ /* flush the TLBs after updating live kernel mappings */
+ flush_tlb_all();
+}
+#endif
===>8===
AKASHI Takahiro
2017-01-19 09:49:42 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Great! I think it would be better to follow the approach of
mark_rodata_ro(), rather than opening up set_memory_*(), but otherwise,
it looks like it should work.
I'm not quite sure what the approach of mark_rodata_ro() means, but
1) it fails when PTE_CONT bits mismatch between an old and new mmu entry.
This can happen, say, if the memory range for crash dump kernel
starts in the mid of _continuous_ pages.
That should only happen if we try to remap a segment different to what
we originally mapped.
I was intending that we'd explicitly map the reserved region separately
in the boot path, like we do for kernel segments in map_kernel(). We
would allow sections and/or CONT entires.
Then, in __map_memblock() we'd then skip that range as we do for the
linear map alias of the kernel image.
That way, we can later use create_mapping_late for that same region, and
it should handle sections and/or CONT entries in the exact same way as
it does for the kernel image segments in mark_rodata_ro().
I see.
Which one do you prefer, yours above or my (second) solution?
Either way, they do almost the same thing in terms of mapping.
Post by Mark Rutland
Post by AKASHI Takahiro
2) The control code page, of one-page size, is still written out in
machine_kexec() which is called at a crash, and this means that
the range must be writable even after kexec_load(), but
create_mapping_late() does not handle a case of changing attributes
for a single page which is in _section_ mapping.
We cannot make single-page mapping for the control page since the address
of that page is not determined at the boot time.
That is a problem. I'm not sure I follow how set_memory_*() helps here
though?
Post by AKASHI Takahiro
As for (1), we need to call memblock_isolate_range() to make the region
an independent one.
Post by Mark Rutland
Either way, this still leaves us with an RO alias on crashed cores (and
potential cache attribute mismatches in future). Do we need to read from
the region later,
I believe not, but the region must be _writable_ as I mentioned in (2) above.
To avoid this issue, we have to move copying the control code page
to machine_kexec_prepare() which is called in kexec_load() and so
the region is writable anyway then.
I want Geoff to affirm that this change is safe.
(See my second solution below.)
From a quick scan that looks ok.
Post by AKASHI Takahiro
Post by Mark Rutland
or could we unmap it entirely?
given the change above, I think we can.
I'm now asking Geoff ...
Post by Mark Rutland
Great!
Post by AKASHI Takahiro
Is there any code to re-use especially for unmapping?
I don't think we have much code useful for unmapping. We could re-use
create_mapping_late for this, passing a set of prot bits that means the
entries are invalid (e.g. have a PAGE_KERNEL_INVALID).
Do you really think that we should totally invalidate mmu entries?
I guess that, given proper cache & TLB flush operations, RO attribute is
good enough for memory consistency, no?
(None accesses the region, as I said, except in the case of re-loading
crash dump kernel though.)
Post by Mark Rutland
We'd have to perform the TLB invalidation ourselves, but that shouldn't
be too painful.
Do we need to invalidate TLBs not only before but also after changing
permission attributes as make_rodata_ro() does?

-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
Post by AKASHI Takahiro
===8<===
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index c0fc3d458195..80a52e9aaf73 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -26,8 +26,6 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;
-static unsigned long kimage_start;
-
/**
* kexec_image_info - For debugging output.
*/
@@ -68,7 +66,7 @@ void machine_kexec_cleanup(struct kimage *kimage)
*/
int machine_kexec_prepare(struct kimage *kimage)
{
- kimage_start = kimage->start;
+ void *reboot_code_buffer;
kexec_image_info(kimage);
@@ -77,6 +75,21 @@ int machine_kexec_prepare(struct kimage *kimage)
return -EBUSY;
}
+ reboot_code_buffer =
+ phys_to_virt(page_to_phys(kimage->control_code_page));
+
+ /*
+ * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
+ * after the kernel is shut down.
+ */
+ memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
+ arm64_relocate_new_kernel_size);
+
+ /* Flush the reboot_code_buffer in preparation for its execution. */
+ __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
+ flush_icache_range((uintptr_t)reboot_code_buffer,
+ arm64_relocate_new_kernel_size);
+
return 0;
}
@@ -147,7 +160,6 @@ static void kexec_segment_flush(const struct kimage *kimage)
void machine_kexec(struct kimage *kimage)
{
phys_addr_t reboot_code_buffer_phys;
- void *reboot_code_buffer;
/*
* New cpus may have become stuck_in_kernel after we loaded the image.
@@ -156,7 +168,6 @@ void machine_kexec(struct kimage *kimage)
!WARN_ON(kimage == kexec_crash_image));
reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
- reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
kexec_image_info(kimage);
@@ -164,26 +175,12 @@ void machine_kexec(struct kimage *kimage)
kimage->control_code_page);
pr_debug("%s:%d: reboot_code_buffer_phys: %pa\n", __func__, __LINE__,
&reboot_code_buffer_phys);
- pr_debug("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__,
- reboot_code_buffer);
pr_debug("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__,
arm64_relocate_new_kernel);
pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
__func__, __LINE__, arm64_relocate_new_kernel_size,
arm64_relocate_new_kernel_size);
- /*
- * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
- * after the kernel is shut down.
- */
- memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
- arm64_relocate_new_kernel_size);
-
- /* Flush the reboot_code_buffer in preparation for its execution. */
- __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
- flush_icache_range((uintptr_t)reboot_code_buffer,
- arm64_relocate_new_kernel_size);
-
/* Flush the kimage list and its buffers. */
kexec_list_flush(kimage);
@@ -206,7 +203,7 @@ void machine_kexec(struct kimage *kimage)
*/
cpu_soft_restart(kimage != kexec_crash_image,
- reboot_code_buffer_phys, kimage->head, kimage_start, 0);
+ reboot_code_buffer_phys, kimage->head, kimage->start, 0);
BUG(); /* Should never get here. */
}
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 569ec3325bc8..e4cc170edc0c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -90,6 +90,7 @@ early_param("initrd", early_initrd);
static void __init reserve_crashkernel(void)
{
unsigned long long crash_size, crash_base;
+ int start_rgn, end_rgn;
int ret;
ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
@@ -120,6 +121,8 @@ static void __init reserve_crashkernel(void)
return;
}
}
+ memblock_isolate_range(&memblock.memory, crash_base, crash_size,
+ &start_rgn, &end_rgn);
memblock_reserve(crash_base, crash_size);
pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..b7c75845407a 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,8 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/ioport.h>
+#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
@@ -817,3 +819,27 @@ int pmd_clear_huge(pmd_t *pmd)
pmd_clear(pmd);
return 1;
}
+
+#ifdef CONFIG_KEXEC_CORE
+void arch_kexec_protect_crashkres(void)
+{
+ flush_tlb_all();
+
+ create_mapping_late(crashk_res.start, __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL_RO);
+
+ /* flush the TLBs after updating live kernel mappings */
+ flush_tlb_all();
+}
+
+void arch_kexec_unprotect_crashkres(void)
+{
+ flush_tlb_all();
+
+ create_mapping_late(crashk_res.start, __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL);
+
+ /* flush the TLBs after updating live kernel mappings */
+ flush_tlb_all();
+}
+#endif
===>8===
Mark Rutland
2017-01-19 11:28:50 UTC
Permalink
Post by Will Deacon
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Great! I think it would be better to follow the approach of
mark_rodata_ro(), rather than opening up set_memory_*(), but otherwise,
it looks like it should work.
I'm not quite sure what the approach of mark_rodata_ro() means, but
1) it fails when PTE_CONT bits mismatch between an old and new mmu entry.
This can happen, say, if the memory range for crash dump kernel
starts in the mid of _continuous_ pages.
That should only happen if we try to remap a segment different to what
we originally mapped.
I was intending that we'd explicitly map the reserved region separately
in the boot path, like we do for kernel segments in map_kernel(). We
would allow sections and/or CONT entires.
Then, in __map_memblock() we'd then skip that range as we do for the
linear map alias of the kernel image.
That way, we can later use create_mapping_late for that same region, and
it should handle sections and/or CONT entries in the exact same way as
it does for the kernel image segments in mark_rodata_ro().
I see.
Which one do you prefer, yours above or my (second) solution?
Either way, they do almost the same thing in terms of mapping.
While both should work, I'd prefer to match the existing map_kernel()
logic, (i.e. my suggestion above), for consistency.
Post by Will Deacon
Post by Mark Rutland
I don't think we have much code useful for unmapping. We could re-use
create_mapping_late for this, passing a set of prot bits that means the
entries are invalid (e.g. have a PAGE_KERNEL_INVALID).
Do you really think that we should totally invalidate mmu entries?
I guess that, given proper cache & TLB flush operations, RO attribute is
good enough for memory consistency, no?
(None accesses the region, as I said, except in the case of re-loading
crash dump kernel though.)
My worry is that the first kernel and kdump kernel may map (portions of)
the region with potentially confliciting memory attributes. So it would
be necessary to completely unmap the region.

You raise a good point that this would also mean we need to perform some
cache maintenance, which makes that a little more painful. We'd need a
sequence like:

* Unmap the region
* TLB invalidation
* Remap the region with non-cacheable attributes
* Cache maintenance
* Unmap the region
* TLB invalidation
Post by Will Deacon
Post by Mark Rutland
We'd have to perform the TLB invalidation ourselves, but that shouldn't
be too painful.
Do we need to invalidate TLBs not only before but also after changing
permission attributes as make_rodata_ro() does?
I believe we'd only have to perform the TLB invalidation after the
change of attributes.

Thanks,
Mark.
AKASHI Takahiro
2017-01-23 09:51:46 UTC
Permalink
Mark,
Post by Mark Rutland
Post by Will Deacon
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Great! I think it would be better to follow the approach of
mark_rodata_ro(), rather than opening up set_memory_*(), but otherwise,
it looks like it should work.
I'm not quite sure what the approach of mark_rodata_ro() means, but
1) it fails when PTE_CONT bits mismatch between an old and new mmu entry.
This can happen, say, if the memory range for crash dump kernel
starts in the mid of _continuous_ pages.
That should only happen if we try to remap a segment different to what
we originally mapped.
I was intending that we'd explicitly map the reserved region separately
in the boot path, like we do for kernel segments in map_kernel(). We
would allow sections and/or CONT entires.
Then, in __map_memblock() we'd then skip that range as we do for the
linear map alias of the kernel image.
That way, we can later use create_mapping_late for that same region, and
it should handle sections and/or CONT entries in the exact same way as
it does for the kernel image segments in mark_rodata_ro().
I see.
Which one do you prefer, yours above or my (second) solution?
Either way, they do almost the same thing in terms of mapping.
While both should work, I'd prefer to match the existing map_kernel()
logic, (i.e. my suggestion above), for consistency.
OK
Post by Mark Rutland
Post by Will Deacon
Post by Mark Rutland
I don't think we have much code useful for unmapping. We could re-use
create_mapping_late for this, passing a set of prot bits that means the
entries are invalid (e.g. have a PAGE_KERNEL_INVALID).
Do you really think that we should totally invalidate mmu entries?
I guess that, given proper cache & TLB flush operations, RO attribute is
good enough for memory consistency, no?
(None accesses the region, as I said, except in the case of re-loading
crash dump kernel though.)
My worry is that the first kernel and kdump kernel may map (portions of)
the region with potentially confliciting memory attributes. So it would
be necessary to completely unmap the region.
I think that this can happen only if the second kernel boots up,
leaving non-crashed cpus still running for some reason.
Post by Mark Rutland
You raise a good point that this would also mean we need to perform some
cache maintenance, which makes that a little more painful. We'd need a
* Unmap the region
* TLB invalidation
* Remap the region with non-cacheable attributes
* Cache maintenance
* Unmap the region
* TLB invalidation
I don't get why we need to remap the region and do cache
maintenance here. Please elaborate a bit more?
My current implementation of arch_kexec_protect_crashkres() is:

kexec_segment_flush(kexec_crash_image);
create_mapping_late(crashk_res.start, ..., __pgprot(0));
or PAGE_KERNEL_INVALID
flush_tlb_all();

kexec_segment_flush() will eventually do dcache-flush for all the modified
data in crash dump kernel memory.
Post by Mark Rutland
Post by Will Deacon
Post by Mark Rutland
We'd have to perform the TLB invalidation ourselves, but that shouldn't
be too painful.
Do we need to invalidate TLBs not only before but also after changing
permission attributes as make_rodata_ro() does?
I believe we'd only have to perform the TLB invalidation after the
change of attributes.
OK

Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
Mark Rutland
2017-01-23 10:23:15 UTC
Permalink
Post by AKASHI Takahiro
Mark,
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
I don't think we have much code useful for unmapping. We could re-use
create_mapping_late for this, passing a set of prot bits that means the
entries are invalid (e.g. have a PAGE_KERNEL_INVALID).
Do you really think that we should totally invalidate mmu entries?
I guess that, given proper cache & TLB flush operations, RO attribute is
good enough for memory consistency, no?
(None accesses the region, as I said, except in the case of re-loading
crash dump kernel though.)
My worry is that the first kernel and kdump kernel may map (portions of)
the region with potentially confliciting memory attributes. So it would
be necessary to completely unmap the region.
I think that this can happen only if the second kernel boots up,
leaving non-crashed cpus still running for some reason.
Yes. I was considering a kdump case where a secondary was stuck in the
first kernel.
Post by AKASHI Takahiro
Post by Mark Rutland
You raise a good point that this would also mean we need to perform some
cache maintenance, which makes that a little more painful. We'd need a
* Unmap the region
* TLB invalidation
* Remap the region with non-cacheable attributes
* Cache maintenance
* Unmap the region
* TLB invalidation
I don't get why we need to remap the region and do cache
maintenance here. Please elaborate a bit more?
I think I was wrong, and we don't need to. Sorry about that.

My thought was that to ensure that there aren't stale lines with
differing attributes, we'd need to do a clean+invalidate while the
caches are guaranteed to not allocate anything furthe. Hence, we'd need
to use a non-cacheable mapping to perform the clean+invalidate.

However, I now think that so long as we unmap the range, this shouldn't
matter. The new kernel can perform the maintenance if it wishes to use
different attributes, similarly to what the first kernel must do per the
boot protocol.
Post by AKASHI Takahiro
kexec_segment_flush(kexec_crash_image);
create_mapping_late(crashk_res.start, ..., __pgprot(0));
or PAGE_KERNEL_INVALID
flush_tlb_all();
kexec_segment_flush() will eventually do dcache-flush for all the modified
data in crash dump kernel memory.
I now think this should be fine, per the above.

Thanks,
Mark.
AKASHI Takahiro
2017-01-24 07:55:00 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
Mark,
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
I don't think we have much code useful for unmapping. We could re-use
create_mapping_late for this, passing a set of prot bits that means the
entries are invalid (e.g. have a PAGE_KERNEL_INVALID).
Do you really think that we should totally invalidate mmu entries?
I guess that, given proper cache & TLB flush operations, RO attribute is
good enough for memory consistency, no?
(None accesses the region, as I said, except in the case of re-loading
crash dump kernel though.)
My worry is that the first kernel and kdump kernel may map (portions of)
the region with potentially confliciting memory attributes. So it would
be necessary to completely unmap the region.
I think that this can happen only if the second kernel boots up,
leaving non-crashed cpus still running for some reason.
Yes. I was considering a kdump case where a secondary was stuck in the
first kernel.
Post by AKASHI Takahiro
Post by Mark Rutland
You raise a good point that this would also mean we need to perform some
cache maintenance, which makes that a little more painful. We'd need a
* Unmap the region
* TLB invalidation
* Remap the region with non-cacheable attributes
* Cache maintenance
* Unmap the region
* TLB invalidation
I don't get why we need to remap the region and do cache
maintenance here. Please elaborate a bit more?
I think I was wrong, and we don't need to. Sorry about that.
My thought was that to ensure that there aren't stale lines with
differing attributes, we'd need to do a clean+invalidate while the
caches are guaranteed to not allocate anything furthe. Hence, we'd need
to use a non-cacheable mapping to perform the clean+invalidate.
However, I now think that so long as we unmap the range, this shouldn't
matter. The new kernel can perform the maintenance if it wishes to use
different attributes, similarly to what the first kernel must do per the
boot protocol.
Post by AKASHI Takahiro
kexec_segment_flush(kexec_crash_image);
create_mapping_late(crashk_res.start, ..., __pgprot(0));
or PAGE_KERNEL_INVALID
flush_tlb_all();
kexec_segment_flush() will eventually do dcache-flush for all the modified
data in crash dump kernel memory.
I now think this should be fine, per the above.
OK.
I think that now I can see a light of the goal :)

-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
AKASHI Takahiro
2016-12-28 04:36:01 UTC
Permalink
Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
and save registers' status in per-cpu ELF notes before starting crash
dump kernel. See kernel_kexec().
Even if not all secondary cpus have shut down, we do kdump anyway.

As we don't have to make non-boot(crashed) cpus offline (to preserve
correct status of cpus at crash dump) before shutting down, this patch
also adds a variant of smp_send_stop().

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 42 +++++++++++++++++++++++++-
arch/arm64/include/asm/smp.h | 2 ++
arch/arm64/kernel/machine_kexec.c | 56 ++++++++++++++++++++++++++++++++--
arch/arm64/kernel/smp.c | 63 +++++++++++++++++++++++++++++++++++++++
5 files changed, 160 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/hardirq.h b/arch/arm64/include/asm/hardirq.h
index 8740297dac77..1473fc2f7ab7 100644
--- a/arch/arm64/include/asm/hardirq.h
+++ b/arch/arm64/include/asm/hardirq.h
@@ -20,7 +20,7 @@
#include <linux/threads.h>
#include <asm/irq.h>

-#define NR_IPI 6
+#define NR_IPI 7

typedef struct {
unsigned int __softirq_pending;
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 04744dc5fb61..f40ace1fa21a 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -40,7 +40,47 @@
static inline void crash_setup_regs(struct pt_regs *newregs,
struct pt_regs *oldregs)
{
- /* Empty routine needed to avoid build errors. */
+ if (oldregs) {
+ memcpy(newregs, oldregs, sizeof(*newregs));
+ } else {
+ u64 tmp1, tmp2;
+
+ __asm__ __volatile__ (
+ "stp x0, x1, [%2, #16 * 0]\n"
+ "stp x2, x3, [%2, #16 * 1]\n"
+ "stp x4, x5, [%2, #16 * 2]\n"
+ "stp x6, x7, [%2, #16 * 3]\n"
+ "stp x8, x9, [%2, #16 * 4]\n"
+ "stp x10, x11, [%2, #16 * 5]\n"
+ "stp x12, x13, [%2, #16 * 6]\n"
+ "stp x14, x15, [%2, #16 * 7]\n"
+ "stp x16, x17, [%2, #16 * 8]\n"
+ "stp x18, x19, [%2, #16 * 9]\n"
+ "stp x20, x21, [%2, #16 * 10]\n"
+ "stp x22, x23, [%2, #16 * 11]\n"
+ "stp x24, x25, [%2, #16 * 12]\n"
+ "stp x26, x27, [%2, #16 * 13]\n"
+ "stp x28, x29, [%2, #16 * 14]\n"
+ "mov %0, sp\n"
+ "stp x30, %0, [%2, #16 * 15]\n"
+
+ "/* faked current PSTATE */\n"
+ "mrs %0, CurrentEL\n"
+ "mrs %1, SPSEL\n"
+ "orr %0, %0, %1\n"
+ "mrs %1, DAIF\n"
+ "orr %0, %0, %1\n"
+ "mrs %1, NZCV\n"
+ "orr %0, %0, %1\n"
+ /* pc */
+ "adr %1, 1f\n"
+ "1:\n"
+ "stp %1, %0, [%2, #16 * 16]\n"
+ : "=&r" (tmp1), "=&r" (tmp2)
+ : "r" (newregs)
+ : "memory"
+ );
+ }
}

#endif /* __ASSEMBLY__ */
diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index d050d720a1b4..cea009f2657d 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -148,6 +148,8 @@ static inline void cpu_panic_kernel(void)
*/
bool cpus_are_stuck_in_kernel(void);

+extern void smp_send_crash_stop(void);
+
#endif /* ifndef __ASSEMBLY__ */

#endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index bc96c8a7fc79..c60346d33bb1 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -9,6 +9,9 @@
* published by the Free Software Foundation.
*/

+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
#include <linux/kexec.h>
#include <linux/smp.h>

@@ -22,6 +25,7 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;

+static bool in_crash_kexec;
static unsigned long kimage_start;

/**
@@ -148,7 +152,8 @@ void machine_kexec(struct kimage *kimage)
/*
* New cpus may have become stuck_in_kernel after we loaded the image.
*/
- BUG_ON(cpus_are_stuck_in_kernel() || (num_online_cpus() > 1));
+ BUG_ON((cpus_are_stuck_in_kernel() || (num_online_cpus() > 1)) &&
+ !WARN_ON(in_crash_kexec));

reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
@@ -200,13 +205,58 @@ void machine_kexec(struct kimage *kimage)
* relocation is complete.
*/

- cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head,
+ cpu_soft_restart(!in_crash_kexec, reboot_code_buffer_phys, kimage->head,
kimage_start, 0);

BUG(); /* Should never get here. */
}

+static void machine_kexec_mask_interrupts(void)
+{
+ unsigned int i;
+ struct irq_desc *desc;
+
+ for_each_irq_desc(i, desc) {
+ struct irq_chip *chip;
+ int ret;
+
+ chip = irq_desc_get_chip(desc);
+ if (!chip)
+ continue;
+
+ /*
+ * First try to remove the active state. If this
+ * fails, try to EOI the interrupt.
+ */
+ ret = irq_set_irqchip_state(i, IRQCHIP_STATE_ACTIVE, false);
+
+ if (ret && irqd_irq_inprogress(&desc->irq_data) &&
+ chip->irq_eoi)
+ chip->irq_eoi(&desc->irq_data);
+
+ if (chip->irq_mask)
+ chip->irq_mask(&desc->irq_data);
+
+ if (chip->irq_disable && !irqd_irq_disabled(&desc->irq_data))
+ chip->irq_disable(&desc->irq_data);
+ }
+}
+
+/**
+ * machine_crash_shutdown - shutdown non-crashing cpus and save registers
+ */
void machine_crash_shutdown(struct pt_regs *regs)
{
- /* Empty routine needed to avoid build errors. */
+ local_irq_disable();
+
+ in_crash_kexec = true;
+
+ /* shutdown non-crashing cpus */
+ smp_send_crash_stop();
+
+ /* for crashing cpu */
+ crash_save_cpu(regs, smp_processor_id());
+ machine_kexec_mask_interrupts();
+
+ pr_info("Starting crashdump kernel...\n");
}
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index cb87234cfcf2..446c6d48f8ec 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
#include <linux/completion.h>
#include <linux/of.h>
#include <linux/irq_work.h>
+#include <linux/kexec.h>

#include <asm/alternative.h>
#include <asm/atomic.h>
@@ -74,6 +75,7 @@ enum ipi_msg_type {
IPI_RESCHEDULE,
IPI_CALL_FUNC,
IPI_CPU_STOP,
+ IPI_CPU_CRASH_STOP,
IPI_TIMER,
IPI_IRQ_WORK,
IPI_WAKEUP
@@ -753,6 +755,7 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
S(IPI_RESCHEDULE, "Rescheduling interrupts"),
S(IPI_CALL_FUNC, "Function call interrupts"),
S(IPI_CPU_STOP, "CPU stop interrupts"),
+ S(IPI_CPU_CRASH_STOP, "CPU stop (for crash dump) interrupts"),
S(IPI_TIMER, "Timer broadcast interrupts"),
S(IPI_IRQ_WORK, "IRQ work interrupts"),
S(IPI_WAKEUP, "CPU wake-up interrupts"),
@@ -827,6 +830,29 @@ static void ipi_cpu_stop(unsigned int cpu)
cpu_relax();
}

+#ifdef CONFIG_KEXEC_CORE
+static atomic_t waiting_for_crash_ipi;
+#endif
+
+static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
+{
+#ifdef CONFIG_KEXEC_CORE
+ crash_save_cpu(regs, cpu);
+
+ atomic_dec(&waiting_for_crash_ipi);
+
+ local_irq_disable();
+
+#ifdef CONFIG_HOTPLUG_CPU
+ if (cpu_ops[cpu]->cpu_die)
+ cpu_ops[cpu]->cpu_die(cpu);
+#endif
+
+ /* just in case */
+ cpu_park_loop();
+#endif
+}
+
/*
* Main handler for inter-processor interrupts
*/
@@ -857,6 +883,15 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
irq_exit();
break;

+ case IPI_CPU_CRASH_STOP:
+ if (IS_ENABLED(CONFIG_KEXEC_CORE)) {
+ irq_enter();
+ ipi_cpu_crash_stop(cpu, regs);
+
+ unreachable();
+ }
+ break;
+
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
case IPI_TIMER:
irq_enter();
@@ -929,6 +964,34 @@ void smp_send_stop(void)
cpumask_pr_args(cpu_online_mask));
}

+#ifdef CONFIG_KEXEC_CORE
+void smp_send_crash_stop(void)
+{
+ cpumask_t mask;
+ unsigned long timeout;
+
+ if (num_online_cpus() == 1)
+ return;
+
+ cpumask_copy(&mask, cpu_online_mask);
+ cpumask_clear_cpu(smp_processor_id(), &mask);
+
+ atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
+
+ pr_crit("SMP: stopping secondary CPUs\n");
+ smp_cross_call(&mask, IPI_CPU_CRASH_STOP);
+
+ /* Wait up to one second for other CPUs to stop */
+ timeout = USEC_PER_SEC;
+ while ((atomic_read(&waiting_for_crash_ipi) > 0) && timeout--)
+ udelay(1);
+
+ if (atomic_read(&waiting_for_crash_ipi) > 0)
+ pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
+ cpumask_pr_args(cpu_online_mask));
+}
+#endif
+
/*
* not supported here
*/
--
2.11.0
Will Deacon
2017-01-10 11:32:48 UTC
Permalink
Post by AKASHI Takahiro
Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
and save registers' status in per-cpu ELF notes before starting crash
dump kernel. See kernel_kexec().
Even if not all secondary cpus have shut down, we do kdump anyway.
As we don't have to make non-boot(crashed) cpus offline (to preserve
correct status of cpus at crash dump) before shutting down, this patch
also adds a variant of smp_send_stop().
---
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 42 +++++++++++++++++++++++++-
arch/arm64/include/asm/smp.h | 2 ++
arch/arm64/kernel/machine_kexec.c | 56 ++++++++++++++++++++++++++++++++--
arch/arm64/kernel/smp.c | 63 +++++++++++++++++++++++++++++++++++++++
5 files changed, 160 insertions(+), 5 deletions(-)
[...]
Post by AKASHI Takahiro
diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index d050d720a1b4..cea009f2657d 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -148,6 +148,8 @@ static inline void cpu_panic_kernel(void)
*/
bool cpus_are_stuck_in_kernel(void);
+extern void smp_send_crash_stop(void);
+
#endif /* ifndef __ASSEMBLY__ */
#endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index bc96c8a7fc79..c60346d33bb1 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -9,6 +9,9 @@
* published by the Free Software Foundation.
*/
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
#include <linux/kexec.h>
#include <linux/smp.h>
@@ -22,6 +25,7 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;
+static bool in_crash_kexec;
Do you actually need this bool? Why not call kexec_crash_loaded() instead?

Will
AKASHI Takahiro
2017-01-11 06:36:28 UTC
Permalink
Hi Will,
Post by Will Deacon
Post by AKASHI Takahiro
Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
and save registers' status in per-cpu ELF notes before starting crash
dump kernel. See kernel_kexec().
Even if not all secondary cpus have shut down, we do kdump anyway.
As we don't have to make non-boot(crashed) cpus offline (to preserve
correct status of cpus at crash dump) before shutting down, this patch
also adds a variant of smp_send_stop().
---
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 42 +++++++++++++++++++++++++-
arch/arm64/include/asm/smp.h | 2 ++
arch/arm64/kernel/machine_kexec.c | 56 ++++++++++++++++++++++++++++++++--
arch/arm64/kernel/smp.c | 63 +++++++++++++++++++++++++++++++++++++++
5 files changed, 160 insertions(+), 5 deletions(-)
[...]
Post by AKASHI Takahiro
diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index d050d720a1b4..cea009f2657d 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -148,6 +148,8 @@ static inline void cpu_panic_kernel(void)
*/
bool cpus_are_stuck_in_kernel(void);
+extern void smp_send_crash_stop(void);
+
#endif /* ifndef __ASSEMBLY__ */
#endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index bc96c8a7fc79..c60346d33bb1 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -9,6 +9,9 @@
* published by the Free Software Foundation.
*/
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
#include <linux/kexec.h>
#include <linux/smp.h>
@@ -22,6 +25,7 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;
+static bool in_crash_kexec;
Do you actually need this bool? Why not call kexec_crash_loaded() instead?
The two have different meanings:
"in_crash_kexec" indicates that kdump is taking place, while
kexec_crash_loaded() tells us only whether crash dump kernel has been
loaded or not.

It is crucial to distinguish them especially for machine_kexec()
which can be called on normal kexec even if kdump has been set up.

Thanks,
-Takahiro AKASHI
Post by Will Deacon
Will
Will Deacon
2017-01-11 10:54:05 UTC
Permalink
Post by AKASHI Takahiro
Post by Will Deacon
Post by AKASHI Takahiro
@@ -22,6 +25,7 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;
+static bool in_crash_kexec;
Do you actually need this bool? Why not call kexec_crash_loaded() instead?
"in_crash_kexec" indicates that kdump is taking place, while
kexec_crash_loaded() tells us only whether crash dump kernel has been
loaded or not.
It is crucial to distinguish them especially for machine_kexec()
which can be called on normal kexec even if kdump has been set up.
Ah, I see. So how about just doing:

if (kimage == kexec_crash_image)

in machine_kexec?

Will
AKASHI Takahiro
2017-01-12 04:21:44 UTC
Permalink
Post by Will Deacon
Post by AKASHI Takahiro
Post by Will Deacon
Post by AKASHI Takahiro
@@ -22,6 +25,7 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;
+static bool in_crash_kexec;
Do you actually need this bool? Why not call kexec_crash_loaded() instead?
"in_crash_kexec" indicates that kdump is taking place, while
kexec_crash_loaded() tells us only whether crash dump kernel has been
loaded or not.
It is crucial to distinguish them especially for machine_kexec()
which can be called on normal kexec even if kdump has been set up.
if (kimage == kexec_crash_image)
in machine_kexec?
Yeah, it should work.
Do you want to merge the following hunk,
or expect that I will re-send the whole patch series
(with other changes if any)?

-Takahiro AkASHI
Post by Will Deacon
Will
===8<==
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 994fe0bc5cc0..c0fc3d458195 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -26,7 +26,6 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;

-static bool in_crash_kexec;
static unsigned long kimage_start;

/**
@@ -154,7 +153,7 @@ void machine_kexec(struct kimage *kimage)
* New cpus may have become stuck_in_kernel after we loaded the image.
*/
BUG_ON((cpus_are_stuck_in_kernel() || (num_online_cpus() > 1)) &&
- !WARN_ON(in_crash_kexec));
+ !WARN_ON(kimage == kexec_crash_image));

reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
@@ -206,8 +205,8 @@ void machine_kexec(struct kimage *kimage)
* relocation is complete.
*/

- cpu_soft_restart(!in_crash_kexec, reboot_code_buffer_phys, kimage->head,
- kimage_start, 0);
+ cpu_soft_restart(kimage != kexec_crash_image,
+ reboot_code_buffer_phys, kimage->head, kimage_start, 0);

BUG(); /* Should never get here. */
}
@@ -250,8 +249,6 @@ void machine_crash_shutdown(struct pt_regs *regs)
{
local_irq_disable();

- in_crash_kexec = true;
-
/* shutdown non-crashing cpus */
smp_send_crash_stop();
Will Deacon
2017-01-12 12:01:11 UTC
Permalink
Post by AKASHI Takahiro
Post by Will Deacon
Post by AKASHI Takahiro
Post by Will Deacon
Post by AKASHI Takahiro
@@ -22,6 +25,7 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;
+static bool in_crash_kexec;
Do you actually need this bool? Why not call kexec_crash_loaded() instead?
"in_crash_kexec" indicates that kdump is taking place, while
kexec_crash_loaded() tells us only whether crash dump kernel has been
loaded or not.
It is crucial to distinguish them especially for machine_kexec()
which can be called on normal kexec even if kdump has been set up.
if (kimage == kexec_crash_image)
in machine_kexec?
Yeah, it should work.
Do you want to merge the following hunk,
or expect that I will re-send the whole patch series
(with other changes if any)?
Thanks, I'll fold it in and shout if I run into any problems. My plan is
to queue this for 4.11.

Will
Will Deacon
2017-01-23 17:46:34 UTC
Permalink
Post by Will Deacon
Post by AKASHI Takahiro
Post by Will Deacon
Post by AKASHI Takahiro
Post by Will Deacon
Post by AKASHI Takahiro
@@ -22,6 +25,7 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;
+static bool in_crash_kexec;
Do you actually need this bool? Why not call kexec_crash_loaded() instead?
"in_crash_kexec" indicates that kdump is taking place, while
kexec_crash_loaded() tells us only whether crash dump kernel has been
loaded or not.
It is crucial to distinguish them especially for machine_kexec()
which can be called on normal kexec even if kdump has been set up.
if (kimage == kexec_crash_image)
in machine_kexec?
Yeah, it should work.
Do you want to merge the following hunk,
or expect that I will re-send the whole patch series
(with other changes if any)?
Thanks, I'll fold it in and shout if I run into any problems. My plan is
to queue this for 4.11.
Given the DT discussion with Mark, I assume you'll post a new version with
this rolled in.

Will
AKASHI Takahiro
2017-01-24 07:52:16 UTC
Permalink
Will,
Post by Will Deacon
Post by Will Deacon
Post by AKASHI Takahiro
Post by Will Deacon
Post by AKASHI Takahiro
Post by Will Deacon
Post by AKASHI Takahiro
@@ -22,6 +25,7 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;
+static bool in_crash_kexec;
Do you actually need this bool? Why not call kexec_crash_loaded() instead?
"in_crash_kexec" indicates that kdump is taking place, while
kexec_crash_loaded() tells us only whether crash dump kernel has been
loaded or not.
It is crucial to distinguish them especially for machine_kexec()
which can be called on normal kexec even if kdump has been set up.
if (kimage == kexec_crash_image)
in machine_kexec?
Yeah, it should work.
Do you want to merge the following hunk,
or expect that I will re-send the whole patch series
(with other changes if any)?
Thanks, I'll fold it in and shout if I run into any problems. My plan is
to queue this for 4.11.
Given the DT discussion with Mark, I assume you'll post a new version with
this rolled in.
Yes, I will!

-Takahiro AKASHI
Post by Will Deacon
Will
AKASHI Takahiro
2016-12-28 04:36:02 UTC
Permalink
In addition to common VMCOREINFO's defined in
crash_save_vmcoreinfo_init(), we need to know, for crash utility,
- kimage_voffset
- PHYS_OFFSET
to examine the contents of a dump file (/proc/vmcore) correctly
due to the introduction of KASLR (CONFIG_RANDOMIZE_BASE) in v4.6.

- VA_BITS
is also required for makedumpfile command.

arch_crash_save_vmcoreinfo() appends them to the dump file.
More VMCOREINFO's may be added later.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/kernel/machine_kexec.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index c60346d33bb1..994fe0bc5cc0 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -17,6 +17,7 @@

#include <asm/cacheflush.h>
#include <asm/cpu_ops.h>
+#include <asm/memory.h>
#include <asm/mmu_context.h>

#include "cpu-reset.h"
@@ -260,3 +261,13 @@ void machine_crash_shutdown(struct pt_regs *regs)

pr_info("Starting crashdump kernel...\n");
}
+
+void arch_crash_save_vmcoreinfo(void)
+{
+ VMCOREINFO_NUMBER(VA_BITS);
+ /* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
+ vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
+ kimage_voffset);
+ vmcoreinfo_append_str("NUMBER(PHYS_OFFSET)=0x%llx\n",
+ PHYS_OFFSET);
+}
--
2.11.0
AKASHI Takahiro
2016-12-28 04:36:03 UTC
Permalink
Add arch-specific functions to provide a dump file, /proc/vmcore.

This file is in ELF format and its ELF header needs to be prepared by
userspace tools, like kexec-tools, in adance. The primary kernel is
responsible to allocate the region with reserve_elfcorehdr() at boot time
and advertize its location to crash dump kernel via a new device-tree
property, "linux,elfcorehdr".

Then crash dump kernel will access the primary kernel's memory with
copy_oldmem_page(), which feeds the data page-by-page by ioremap'ing it
since it does not reside in linear mapping on crash dump kernel.

We also need our own elfcorehdr_read() here since the header is placed
within crash dump kernel's usable memory.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/Kconfig | 11 +++++++
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/crash_dump.c | 71 ++++++++++++++++++++++++++++++++++++++++++
arch/arm64/mm/init.c | 54 ++++++++++++++++++++++++++++++++
4 files changed, 137 insertions(+)
create mode 100644 arch/arm64/kernel/crash_dump.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 111742126897..2bd6a1a062b9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -693,6 +693,17 @@ config KEXEC
but it is independent of the system firmware. And like a reboot
you can start any kernel with it, not just Linux.

+config CRASH_DUMP
+ bool "Build kdump crash kernel"
+ help
+ Generate crash dump after being started by kexec. This should
+ be normally only set in special crash dump kernels which are
+ loaded in the main kernel with kexec-tools into a specially
+ reserved region and then later executed after a crash by
+ kdump/kexec.
+
+ For more details see Documentation/kdump/kdump.txt
+
config XEN_DOM0
def_bool y
depends on XEN
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 7d66bbaafc0c..6a7384eee08d 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -50,6 +50,7 @@ arm64-obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
arm64-obj-$(CONFIG_HIBERNATION) += hibernate.o hibernate-asm.o
arm64-obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o \
cpu-reset.o
+arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o

obj-y += $(arm64-obj-y) vdso/ probes/
obj-m += $(arm64-obj-m)
diff --git a/arch/arm64/kernel/crash_dump.c b/arch/arm64/kernel/crash_dump.c
new file mode 100644
index 000000000000..c3d5a21c081e
--- /dev/null
+++ b/arch/arm64/kernel/crash_dump.c
@@ -0,0 +1,71 @@
+/*
+ * Routines for doing kexec-based kdump
+ *
+ * Copyright (C) 2014 Linaro Limited
+ * Author: AKASHI Takahiro <***@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/crash_dump.h>
+#include <linux/errno.h>
+#include <linux/io.h>
+#include <linux/memblock.h>
+#include <linux/uaccess.h>
+#include <asm/memory.h>
+
+/**
+ * copy_oldmem_page() - copy one page from old kernel memory
+ * @pfn: page frame number to be copied
+ * @buf: buffer where the copied page is placed
+ * @csize: number of bytes to copy
+ * @offset: offset in bytes into the page
+ * @userbuf: if set, @buf is in a user address space
+ *
+ * This function copies one page from old kernel memory into buffer pointed by
+ * @buf. If @buf is in userspace, set @userbuf to %1. Returns number of bytes
+ * copied or negative error in case of failure.
+ */
+ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
+ size_t csize, unsigned long offset,
+ int userbuf)
+{
+ void *vaddr;
+
+ if (!csize)
+ return 0;
+
+ vaddr = memremap(__pfn_to_phys(pfn), PAGE_SIZE, MEMREMAP_WB);
+ if (!vaddr)
+ return -ENOMEM;
+
+ if (userbuf) {
+ if (copy_to_user((char __user *)buf, vaddr + offset, csize)) {
+ memunmap(vaddr);
+ return -EFAULT;
+ }
+ } else {
+ memcpy(buf, vaddr + offset, csize);
+ }
+
+ memunmap(vaddr);
+
+ return csize;
+}
+
+/**
+ * elfcorehdr_read - read from ELF core header
+ * @buf: buffer where the data is placed
+ * @csize: number of bytes to read
+ * @ppos: address in the memory
+ *
+ * This function reads @count bytes from elf core header which exists
+ * on crash dump kernel's memory.
+ */
+ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
+{
+ memcpy(buf, phys_to_virt((phys_addr_t)*ppos), count);
+ return count;
+}
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1d62bf71b531..ef8adfde32e7 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -38,6 +38,7 @@
#include <linux/swiotlb.h>
#include <linux/vmalloc.h>
#include <linux/kexec.h>
+#include <linux/crash_dump.h>

#include <asm/boot.h>
#include <asm/fixmap.h>
@@ -183,6 +184,57 @@ static void __init reserve_crashkernel(void)
}
#endif /* CONFIG_KEXEC_CORE */

+#ifdef CONFIG_CRASH_DUMP
+static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ const __be32 *reg;
+ int len;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,elfcorehdr", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ elfcorehdr_addr = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ elfcorehdr_size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+/*
+ * reserve_elfcorehdr() - reserves memory for elf core header
+ *
+ * This function reserves elf core header given in "elfcorehdr=" kernel
+ * command line parameter. This region contains all the information about
+ * primary kernel's core image and is used by a dump capture kernel to
+ * access the system memory on primary kernel.
+ */
+static void __init reserve_elfcorehdr(void)
+{
+ of_scan_flat_dt(early_init_dt_scan_elfcorehdr, NULL);
+
+ if (!elfcorehdr_size)
+ return;
+
+ if (memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)) {
+ pr_warn("elfcorehdr is overlapped\n");
+ return;
+ }
+
+ memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
+
+ pr_info("Reserving %lldKB of memory at 0x%llx for elfcorehdr\n",
+ elfcorehdr_size >> 10, elfcorehdr_addr);
+}
+#else
+static void __init reserve_elfcorehdr(void)
+{
+ ;
+}
+#endif /* CONFIG_CRASH_DUMP */
/*
* Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
* currently assumes that for memory starting above 4G, 32-bit devices will
@@ -441,6 +493,8 @@ void __init arm64_memblock_init(void)

reserve_crashkernel();

+ reserve_elfcorehdr();
+
dma_contiguous_reserve(arm64_dma_phys_limit);

memblock_allow_resize();
--
2.11.0
AKASHI Takahiro
2016-12-28 04:36:04 UTC
Permalink
Kdump is enabled by default as kexec is.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/configs/defconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 869dded0f09f..5ce7cd00f4ea 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -79,6 +79,7 @@ CONFIG_CMA=y
CONFIG_SECCOMP=y
CONFIG_XEN=y
CONFIG_KEXEC=y
+CONFIG_CRASH_DUMP=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_COMPAT=y
CONFIG_CPU_IDLE=y
--
2.11.0
AKASHI Takahiro
2016-12-28 04:36:05 UTC
Permalink
Add arch specific descriptions about kdump usage on arm64 to kdump.txt.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Baoquan He <***@redhat.com>
Acked-by: Dave Young <***@redhat.com>
Acked-by: Catalin Marinas <***@arm.com>
---
Documentation/kdump/kdump.txt | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index b0eb27b956d9..615434d81108 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
a remote system.

Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
-s390x and arm architectures.
+s390x, arm and arm64 architectures.

When the system kernel boots, it reserves a small section of memory for
the dump-capture kernel. This ensures that ongoing Direct Memory Access
@@ -249,6 +249,13 @@ Dump-capture kernel config options (Arch Dependent, arm)

AUTO_ZRELADDR=y

+Dump-capture kernel config options (Arch Dependent, arm64)
+----------------------------------------------------------
+
+- Please note that kvm of the dump-capture kernel will not be enabled
+ on non-VHE systems even if it is configured. This is because the CPU
+ will not be reset to EL2 on panic.
+
Extended crashkernel syntax
===========================

@@ -305,6 +312,8 @@ Boot into System Kernel
kernel will automatically locate the crash kernel image within the
first 512MB of RAM if X is not given.

+ On arm64, use "crashkernel=Y[@X]". Note that the start address of
+ the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).

Load the Dump-capture Kernel
============================
@@ -327,6 +336,8 @@ For s390x:
- Use image or bzImage
For arm:
- Use zImage
+For arm64:
+ - Use vmlinux or Image

If you are using a uncompressed vmlinux image then use following command
to load dump-capture kernel.
@@ -370,6 +381,9 @@ For s390x:
For arm:
"1 maxcpus=1 reset_devices"

+For arm64:
+ "1 maxcpus=1 reset_devices"
+
Notes on loading the dump-capture kernel:

* By default, the ELF headers are stored in ELF64 format to support
--
2.11.0
AKASHI Takahiro
2016-12-28 04:37:34 UTC
Permalink
From: James Morse <***@arm.com>

Add documentation for
linux,crashkernel-base and crashkernel-size,
linux,usable-memory-range
linux,elfcorehdr
used by arm64 kdump to decribe the kdump reserved area, and
the elfcorehdr's location within it.

Signed-off-by: James Morse <***@arm.com>
[***@linaro.org: added "linux,crashkernel-base" and "-size" ]
Signed-off-by: AKASHI Takahiro <***@linaro.org>
Cc: ***@vger.kernel.org
Cc: Rob Herring <robh+***@kernel.org>
Cc: Mark Rutland <***@arm.com>
---
Documentation/devicetree/bindings/chosen.txt | 50 ++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)

diff --git a/Documentation/devicetree/bindings/chosen.txt b/Documentation/devicetree/bindings/chosen.txt
index 6ae9d82d4c37..7b115165e9ec 100644
--- a/Documentation/devicetree/bindings/chosen.txt
+++ b/Documentation/devicetree/bindings/chosen.txt
@@ -52,3 +52,53 @@ This property is set (currently only on PowerPC, and only needed on
book3e) by some versions of kexec-tools to tell the new kernel that it
is being booted by kexec, as the booting environment may differ (e.g.
a different secondary CPU release mechanism)
+
+linux,crashkernel-base
+linux,crashkernel-size
+----------------------
+
+These properties (currently used on PowerPC and arm64) indicates
+the base address and the size, respectively, of the reserved memory
+range for crash dump kernel.
+e.g.
+
+/ {
+ chosen {
+ linux,crashkernel-base = <0x9 0xf0000000>;
+ linux,crashkernel-size = <0x0 0x10000000>;
+ };
+};
+
+linux,usable-memory-range
+-------------------------
+
+This property (currently used only on arm64) holds the memory range,
+the base address and the size, which can be used as system ram on
+the *current* kernel. Note that, if this property is present, any memory
+regions under "memory" nodes in DT blob or ones marked as "conventional
+memory" in EFI memory map should be ignored.
+e.g.
+
+/ {
+ chosen {
+ linux,usable-memory-range = <0x9 0xf0000000 0x0 0x10000000>;
+ };
+};
+
+The main usage is for crash dump kernel to identify its own usable
+memory and exclude, at its boot time, any other memory areas that are
+part of the panicked kernel's memory.
+
+linux,elfcorehdr
+----------------
+
+This property (currently used only on arm64) holds the memory range,
+the address and the size, of the elf core header which mainly describes
+the panicked kernel's memory layout as PT_LOAD segments of elf format.
+e.g.
+
+/ {
+ chosen {
+ linux,elfcorehdr = <0x9 0xfffff000 0x0 0x800>;
+ };
+};
--
2.11.0
Will Deacon
2017-01-10 11:10:49 UTC
Permalink
Hi James, Akashi,
Post by AKASHI Takahiro
Add documentation for
linux,crashkernel-base and crashkernel-size,
linux,usable-memory-range
linux,elfcorehdr
used by arm64 kdump to decribe the kdump reserved area, and
the elfcorehdr's location within it.
---
Documentation/devicetree/bindings/chosen.txt | 50 ++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
I need an ack from Rob or Mark before I can merge this.

Will
Mark Rutland
2017-01-12 15:39:45 UTC
Permalink
Hi,
Post by AKASHI Takahiro
Add documentation for
linux,crashkernel-base and crashkernel-size,
linux,usable-memory-range
linux,elfcorehdr
used by arm64 kdump to decribe the kdump reserved area, and
the elfcorehdr's location within it.
---
Documentation/devicetree/bindings/chosen.txt | 50 ++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/Documentation/devicetree/bindings/chosen.txt b/Documentation/devicetree/bindings/chosen.txt
index 6ae9d82d4c37..7b115165e9ec 100644
--- a/Documentation/devicetree/bindings/chosen.txt
+++ b/Documentation/devicetree/bindings/chosen.txt
@@ -52,3 +52,53 @@ This property is set (currently only on PowerPC, and only needed on
book3e) by some versions of kexec-tools to tell the new kernel that it
is being booted by kexec, as the booting environment may differ (e.g.
a different secondary CPU release mechanism)
+
+linux,crashkernel-base
+linux,crashkernel-size
+----------------------
+
+These properties (currently used on PowerPC and arm64) indicates
+the base address and the size, respectively, of the reserved memory
+range for crash dump kernel.
From this description, it's not clear to me what the (expected)
consumers of this property are, nor what is expected to provide it.

In previous rounds of review, I had assumed that this was used to
describe a preference to the first kernel as to what region of memory
should be used for a subsequent kdump kernel. Looking around, I'm not
sure if I was correct in that assessment.

I see that arch/powerpc seems to consume this property to configure
crashk_res, but it also rewrites it based on crashk_res, presumably for
the benefit of userspace. It's not clear to me how on powerpc the kdump
kernel knows its memory range -- is more DT modification done in the
kernel and/or userspace?

I disagree with modifying this property to expose it to userspace. For
arm64 we should either ensure that /proc/iomem is consistently usable
(and have userspace consistently use it), or we should expose a new file
specifically to expose this information.

Further, I do not think we need this property. It makes more sense to me
for the preference of a a region to be described to the *first* kernel
using the command line consistently.

So I think we should drop this property, and not use it on arm64. Please
document this as powerpc only.
Post by AKASHI Takahiro
+e.g.
+
+/ {
+ chosen {
+ linux,crashkernel-base = <0x9 0xf0000000>;
+ linux,crashkernel-size = <0x0 0x10000000>;
+ };
+};
+
+linux,usable-memory-range
+-------------------------
+
+This property (currently used only on arm64) holds the memory range,
+the base address and the size, which can be used as system ram on
+the *current* kernel. Note that, if this property is present, any memory
+regions under "memory" nodes in DT blob or ones marked as "conventional
+memory" in EFI memory map should be ignored.
Could you please replace this with:

This property (arm64 only) holds a base address and size, describing a
limited region in which memory may be considered available for use by
the kernel. Memory outside of this range is not available for use.

This property describes a limitation: memory within this range is only
valid when also described through another mechanism that the kernel
would otherwise use to determine available memory (e.g. memory nodes
or the EFI memory map). Valid memory may be sparse within the range.

To clarify why we need this, given by above comments w.r.r. the
linux,crashkernel-* properties:

* It preserves all the original memory map information (e.g. memory
nodes and/or EFI memory map)

* It works consistently, regardless of how the kdump kernel would
otherwise determine which memory to use (memory nodes, EFI, etc).

* It will be simply and reliable for an in-kernel purgatory to insert,
if we need a kexec_file_load()-based kdump (e.g. without requiring
memory map rewrites, and avoiding clashes with command line
parameters). For a first kernel, this is not as big a concern.
Post by AKASHI Takahiro
+linux,elfcorehdr
+----------------
+
+This property (currently used only on arm64) holds the memory range,
+the address and the size, of the elf core header which mainly describes
+the panicked kernel's memory layout as PT_LOAD segments of elf format.
+e.g.
+
+/ {
+ chosen {
+ linux,elfcorehdr = <0x9 0xfffff000 0x0 0x800>;
+ };
+};
This property looks fine to me.

Thanks,
Mark.
AKASHI Takahiro
2017-01-13 09:13:49 UTC
Permalink
Post by Mark Rutland
Hi,
Post by AKASHI Takahiro
Add documentation for
linux,crashkernel-base and crashkernel-size,
linux,usable-memory-range
linux,elfcorehdr
used by arm64 kdump to decribe the kdump reserved area, and
the elfcorehdr's location within it.
---
Documentation/devicetree/bindings/chosen.txt | 50 ++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/Documentation/devicetree/bindings/chosen.txt b/Documentation/devicetree/bindings/chosen.txt
index 6ae9d82d4c37..7b115165e9ec 100644
--- a/Documentation/devicetree/bindings/chosen.txt
+++ b/Documentation/devicetree/bindings/chosen.txt
@@ -52,3 +52,53 @@ This property is set (currently only on PowerPC, and only needed on
book3e) by some versions of kexec-tools to tell the new kernel that it
is being booted by kexec, as the booting environment may differ (e.g.
a different secondary CPU release mechanism)
+
+linux,crashkernel-base
+linux,crashkernel-size
+----------------------
+
+These properties (currently used on PowerPC and arm64) indicates
+the base address and the size, respectively, of the reserved memory
+range for crash dump kernel.
From this description, it's not clear to me what the (expected)
consumers of this property are, nor what is expected to provide it.
In previous rounds of review, I had assumed that this was used to
describe a preference to the first kernel as to what region of memory
should be used for a subsequent kdump kernel. Looking around, I'm not
sure if I was correct in that assessment.
I see that arch/powerpc seems to consume this property to configure
crashk_res, but it also rewrites it based on crashk_res, presumably for
the benefit of userspace. It's not clear to me how on powerpc the kdump
kernel knows its memory range -- is more DT modification done in the
kernel and/or userspace?
I don't believe that powerpc will rewrite the property any way.
As far as I know from *the source code*, powerpc kernel retrieves
the memory range for crash dump kernel from a kernel command line, i.e.
crashkernel=, and then exposes it through DT to userspace (assuming
kexec-tools).
Post by Mark Rutland
I disagree with modifying this property to expose it to userspace. For
Apart from the context of discussions, is this a shared consensus?
Post by Mark Rutland
arm64 we should either ensure that /proc/iomem is consistently usable
(and have userspace consistently use it), or we should expose a new file
specifically to expose this information.
The thing that I had in my mind when adding this property is that
/proc/iomem would be obsolete in the future, then we should have
an alternative in hand.
Post by Mark Rutland
Further, I do not think we need this property. It makes more sense to me
for the preference of a a region to be described to the *first* kernel
using the command line consistently.
So I think we should drop this property, and not use it on arm64. Please
document this as powerpc only.
OK, but if we drop the property from arm64 code, we have no reason
to leave its description in this patch.
(In fact, there are a few more (undocumented) properties that only ppc
uses for kdump.)
Post by Mark Rutland
Post by AKASHI Takahiro
+e.g.
+
+/ {
+ chosen {
+ linux,crashkernel-base = <0x9 0xf0000000>;
+ linux,crashkernel-size = <0x0 0x10000000>;
+ };
+};
+
+linux,usable-memory-range
+-------------------------
+
+This property (currently used only on arm64) holds the memory range,
+the base address and the size, which can be used as system ram on
+the *current* kernel. Note that, if this property is present, any memory
+regions under "memory" nodes in DT blob or ones marked as "conventional
+memory" in EFI memory map should be ignored.
This property (arm64 only) holds a base address and size, describing a
limited region in which memory may be considered available for use by
the kernel. Memory outside of this range is not available for use.
This property describes a limitation: memory within this range is only
valid when also described through another mechanism that the kernel
would otherwise use to determine available memory (e.g. memory nodes
or the EFI memory map). Valid memory may be sparse within the range.
Sure.

Thanks,
-Takahiro AKASHI
Post by Mark Rutland
To clarify why we need this, given by above comments w.r.r. the
* It preserves all the original memory map information (e.g. memory
nodes and/or EFI memory map)
* It works consistently, regardless of how the kdump kernel would
otherwise determine which memory to use (memory nodes, EFI, etc).
* It will be simply and reliable for an in-kernel purgatory to insert,
if we need a kexec_file_load()-based kdump (e.g. without requiring
memory map rewrites, and avoiding clashes with command line
parameters). For a first kernel, this is not as big a concern.
Post by AKASHI Takahiro
+linux,elfcorehdr
+----------------
+
+This property (currently used only on arm64) holds the memory range,
+the address and the size, of the elf core header which mainly describes
+the panicked kernel's memory layout as PT_LOAD segments of elf format.
+e.g.
+
+/ {
+ chosen {
+ linux,elfcorehdr = <0x9 0xfffff000 0x0 0x800>;
+ };
+};
This property looks fine to me.
Thanks,
Mark.
Mark Rutland
2017-01-13 11:17:56 UTC
Permalink
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+linux,crashkernel-base
+linux,crashkernel-size
+----------------------
+
+These properties (currently used on PowerPC and arm64) indicates
+the base address and the size, respectively, of the reserved memory
+range for crash dump kernel.
From this description, it's not clear to me what the (expected)
consumers of this property are, nor what is expected to provide it.
In previous rounds of review, I had assumed that this was used to
describe a preference to the first kernel as to what region of memory
should be used for a subsequent kdump kernel. Looking around, I'm not
sure if I was correct in that assessment.
I see that arch/powerpc seems to consume this property to configure
crashk_res, but it also rewrites it based on crashk_res, presumably for
the benefit of userspace. It's not clear to me how on powerpc the kdump
kernel knows its memory range -- is more DT modification done in the
kernel and/or userspace?
I don't believe that powerpc will rewrite the property any way.
As far as I know from *the source code*, powerpc kernel retrieves
the memory range for crash dump kernel from a kernel command line, i.e.
crashkernel=, and then exposes it through DT to userspace (assuming
kexec-tools).
The rewriting I describe is in export_crashk_values() in
arch/powerpc/kernel/machine_kexec.c, where the code deletes existing the
properties, and adds new ones, to the DT exposed to userspace.

So I think we're just quibbling over the definition of "rewrite".
Post by AKASHI Takahiro
Post by Mark Rutland
arm64 we should either ensure that /proc/iomem is consistently usable
(and have userspace consistently use it), or we should expose a new file
specifically to expose this information.
The thing that I had in my mind when adding this property is that
/proc/iomem would be obsolete in the future, then we should have
an alternative in hand.
Ok.

My disagreement is with using the DT as a channel to convey information
from the kernel to userspace.

I'm more than happy for a new file or other mechanism to express this
information. For example, we could add
/sys/kernel/kexec_crash_{base,size} or similar.
Post by AKASHI Takahiro
Post by Mark Rutland
Further, I do not think we need this property. It makes more sense to me
for the preference of a a region to be described to the *first* kernel
using the command line consistently.
So I think we should drop this property, and not use it on arm64. Please
document this as powerpc only.
OK, but if we drop the property from arm64 code, we have no reason
to leave its description in this patch.
(In fact, there are a few more (undocumented) properties that only ppc
uses for kdump.)
I'm happy to drop it, then.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+linux,usable-memory-range
+-------------------------
+
+This property (currently used only on arm64) holds the memory range,
+the base address and the size, which can be used as system ram on
+the *current* kernel. Note that, if this property is present, any memory
+regions under "memory" nodes in DT blob or ones marked as "conventional
+memory" in EFI memory map should be ignored.
This property (arm64 only) holds a base address and size, describing a
limited region in which memory may be considered available for use by
the kernel. Memory outside of this range is not available for use.
This property describes a limitation: memory within this range is only
valid when also described through another mechanism that the kernel
would otherwise use to determine available memory (e.g. memory nodes
or the EFI memory map). Valid memory may be sparse within the range.
Sure.
Cheers!

Thanks,
Mark.
AKASHI Takahiro
2017-01-16 08:25:07 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+linux,crashkernel-base
+linux,crashkernel-size
+----------------------
+
+These properties (currently used on PowerPC and arm64) indicates
+the base address and the size, respectively, of the reserved memory
+range for crash dump kernel.
From this description, it's not clear to me what the (expected)
consumers of this property are, nor what is expected to provide it.
In previous rounds of review, I had assumed that this was used to
describe a preference to the first kernel as to what region of memory
should be used for a subsequent kdump kernel. Looking around, I'm not
sure if I was correct in that assessment.
I see that arch/powerpc seems to consume this property to configure
crashk_res, but it also rewrites it based on crashk_res, presumably for
the benefit of userspace. It's not clear to me how on powerpc the kdump
kernel knows its memory range -- is more DT modification done in the
kernel and/or userspace?
I don't believe that powerpc will rewrite the property any way.
As far as I know from *the source code*, powerpc kernel retrieves
the memory range for crash dump kernel from a kernel command line, i.e.
crashkernel=, and then exposes it through DT to userspace (assuming
kexec-tools).
The rewriting I describe is in export_crashk_values() in
arch/powerpc/kernel/machine_kexec.c, where the code deletes existing the
properties, and adds new ones, to the DT exposed to userspace.
So I think we're just quibbling over the definition of "rewrite".
Gotcha
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
arm64 we should either ensure that /proc/iomem is consistently usable
(and have userspace consistently use it), or we should expose a new file
specifically to expose this information.
The thing that I had in my mind when adding this property is that
/proc/iomem would be obsolete in the future, then we should have
an alternative in hand.
Ok.
My disagreement is with using the DT as a channel to convey information
from the kernel to userspace.
I'm more than happy for a new file or other mechanism to express this
information. For example, we could add
/sys/kernel/kexec_crash_{base,size} or similar.
It may make sense because /sys/kernel/kexec_crash_size already exists,
so why not kexec_crash_base?
My concern, however, is that this kind of interface might prevent us from
allowing multiple regions to be reserved for crash dump kernel in the future.
(There is an assumption that we have only one region at least on arm64 though.)

Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Further, I do not think we need this property. It makes more sense to me
for the preference of a a region to be described to the *first* kernel
using the command line consistently.
So I think we should drop this property, and not use it on arm64. Please
document this as powerpc only.
OK, but if we drop the property from arm64 code, we have no reason
to leave its description in this patch.
(In fact, there are a few more (undocumented) properties that only ppc
uses for kdump.)
I'm happy to drop it, then.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+linux,usable-memory-range
+-------------------------
+
+This property (currently used only on arm64) holds the memory range,
+the base address and the size, which can be used as system ram on
+the *current* kernel. Note that, if this property is present, any memory
+regions under "memory" nodes in DT blob or ones marked as "conventional
+memory" in EFI memory map should be ignored.
This property (arm64 only) holds a base address and size, describing a
limited region in which memory may be considered available for use by
the kernel. Memory outside of this range is not available for use.
This property describes a limitation: memory within this range is only
valid when also described through another mechanism that the kernel
would otherwise use to determine available memory (e.g. memory nodes
or the EFI memory map). Valid memory may be sparse within the range.
Sure.
Cheers!
Thanks,
Mark.
Dave Young
2017-01-17 08:26:29 UTC
Permalink
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+linux,crashkernel-base
+linux,crashkernel-size
+----------------------
+
+These properties (currently used on PowerPC and arm64) indicates
+the base address and the size, respectively, of the reserved memory
+range for crash dump kernel.
From this description, it's not clear to me what the (expected)
consumers of this property are, nor what is expected to provide it.
In previous rounds of review, I had assumed that this was used to
describe a preference to the first kernel as to what region of memory
should be used for a subsequent kdump kernel. Looking around, I'm not
sure if I was correct in that assessment.
I see that arch/powerpc seems to consume this property to configure
crashk_res, but it also rewrites it based on crashk_res, presumably for
the benefit of userspace. It's not clear to me how on powerpc the kdump
kernel knows its memory range -- is more DT modification done in the
kernel and/or userspace?
I don't believe that powerpc will rewrite the property any way.
As far as I know from *the source code*, powerpc kernel retrieves
the memory range for crash dump kernel from a kernel command line, i.e.
crashkernel=, and then exposes it through DT to userspace (assuming
kexec-tools).
The rewriting I describe is in export_crashk_values() in
arch/powerpc/kernel/machine_kexec.c, where the code deletes existing the
properties, and adds new ones, to the DT exposed to userspace.
So I think we're just quibbling over the definition of "rewrite".
Gotcha
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
arm64 we should either ensure that /proc/iomem is consistently usable
(and have userspace consistently use it), or we should expose a new file
specifically to expose this information.
The thing that I had in my mind when adding this property is that
/proc/iomem would be obsolete in the future, then we should have
an alternative in hand.
Ok.
My disagreement is with using the DT as a channel to convey information
from the kernel to userspace.
I'm more than happy for a new file or other mechanism to express this
information. For example, we could add
/sys/kernel/kexec_crash_{base,size} or similar.
It may make sense because /sys/kernel/kexec_crash_size already exists,
so why not kexec_crash_base?
My concern, however, is that this kind of interface might prevent us from
allowing multiple regions to be reserved for crash dump kernel in the future.
(There is an assumption that we have only one region at least on arm64 though.)
In x86 there could be two ranges, one for softiotlb under 4G and another
for range over 4G, but kexec_crash_size only shows the size of
over-4g-range.

It is better to use /proc/iomem, most arches use /proc/iomem. Do you
have any reason why it will be obsolete? At least for the time being it
is fine.
Post by AKASHI Takahiro
Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Further, I do not think we need this property. It makes more sense to me
for the preference of a a region to be described to the *first* kernel
using the command line consistently.
So I think we should drop this property, and not use it on arm64. Please
document this as powerpc only.
OK, but if we drop the property from arm64 code, we have no reason
to leave its description in this patch.
(In fact, there are a few more (undocumented) properties that only ppc
uses for kdump.)
I'm happy to drop it, then.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+linux,usable-memory-range
+-------------------------
+
+This property (currently used only on arm64) holds the memory range,
+the base address and the size, which can be used as system ram on
+the *current* kernel. Note that, if this property is present, any memory
+regions under "memory" nodes in DT blob or ones marked as "conventional
+memory" in EFI memory map should be ignored.
This property (arm64 only) holds a base address and size, describing a
limited region in which memory may be considered available for use by
the kernel. Memory outside of this range is not available for use.
This property describes a limitation: memory within this range is only
valid when also described through another mechanism that the kernel
would otherwise use to determine available memory (e.g. memory nodes
or the EFI memory map). Valid memory may be sparse within the range.
Sure.
Cheers!
Thanks,
Mark.
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
AKASHI Takahiro
2017-01-19 09:01:11 UTC
Permalink
Post by Dave Young
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+linux,crashkernel-base
+linux,crashkernel-size
+----------------------
+
+These properties (currently used on PowerPC and arm64) indicates
+the base address and the size, respectively, of the reserved memory
+range for crash dump kernel.
From this description, it's not clear to me what the (expected)
consumers of this property are, nor what is expected to provide it.
In previous rounds of review, I had assumed that this was used to
describe a preference to the first kernel as to what region of memory
should be used for a subsequent kdump kernel. Looking around, I'm not
sure if I was correct in that assessment.
I see that arch/powerpc seems to consume this property to configure
crashk_res, but it also rewrites it based on crashk_res, presumably for
the benefit of userspace. It's not clear to me how on powerpc the kdump
kernel knows its memory range -- is more DT modification done in the
kernel and/or userspace?
I don't believe that powerpc will rewrite the property any way.
As far as I know from *the source code*, powerpc kernel retrieves
the memory range for crash dump kernel from a kernel command line, i.e.
crashkernel=, and then exposes it through DT to userspace (assuming
kexec-tools).
The rewriting I describe is in export_crashk_values() in
arch/powerpc/kernel/machine_kexec.c, where the code deletes existing the
properties, and adds new ones, to the DT exposed to userspace.
So I think we're just quibbling over the definition of "rewrite".
Gotcha
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
arm64 we should either ensure that /proc/iomem is consistently usable
(and have userspace consistently use it), or we should expose a new file
specifically to expose this information.
The thing that I had in my mind when adding this property is that
/proc/iomem would be obsolete in the future, then we should have
an alternative in hand.
Ok.
My disagreement is with using the DT as a channel to convey information
from the kernel to userspace.
I'm more than happy for a new file or other mechanism to express this
information. For example, we could add
/sys/kernel/kexec_crash_{base,size} or similar.
It may make sense because /sys/kernel/kexec_crash_size already exists,
so why not kexec_crash_base?
My concern, however, is that this kind of interface might prevent us from
allowing multiple regions to be reserved for crash dump kernel in the future.
(There is an assumption that we have only one region at least on arm64 though.)
In x86 there could be two ranges, one for softiotlb under 4G and another
for range over 4G, but kexec_crash_size only shows the size of
over-4g-range.
It is better to use /proc/iomem, most arches use /proc/iomem. Do you
have any reason why it will be obsolete? At least for the time being it
is fine.
I don't know.
I just think that I might have seen that someone said so somewhere
and that more _powerful_ (structured) tool could supersede it :)

-Takahiro AKASHI
Post by Dave Young
Post by AKASHI Takahiro
Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Further, I do not think we need this property. It makes more sense to me
for the preference of a a region to be described to the *first* kernel
using the command line consistently.
So I think we should drop this property, and not use it on arm64. Please
document this as powerpc only.
OK, but if we drop the property from arm64 code, we have no reason
to leave its description in this patch.
(In fact, there are a few more (undocumented) properties that only ppc
uses for kdump.)
I'm happy to drop it, then.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+linux,usable-memory-range
+-------------------------
+
+This property (currently used only on arm64) holds the memory range,
+the base address and the size, which can be used as system ram on
+the *current* kernel. Note that, if this property is present, any memory
+regions under "memory" nodes in DT blob or ones marked as "conventional
+memory" in EFI memory map should be ignored.
This property (arm64 only) holds a base address and size, describing a
limited region in which memory may be considered available for use by
the kernel. Memory outside of this range is not available for use.
This property describes a limitation: memory within this range is only
valid when also described through another mechanism that the kernel
would otherwise use to determine available memory (e.g. memory nodes
or the EFI memory map). Valid memory may be sparse within the range.
Sure.
Cheers!
Thanks,
Mark.
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
Mark Rutland
2017-01-17 11:13:24 UTC
Permalink
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
arm64 we should either ensure that /proc/iomem is consistently usable
(and have userspace consistently use it), or we should expose a new file
specifically to expose this information.
The thing that I had in my mind when adding this property is that
/proc/iomem would be obsolete in the future, then we should have
an alternative in hand.
Ok.
My disagreement is with using the DT as a channel to convey information
from the kernel to userspace.
I'm more than happy for a new file or other mechanism to express this
information. For example, we could add
/sys/kernel/kexec_crash_{base,size} or similar.
It may make sense because /sys/kernel/kexec_crash_size already exists,
so why not kexec_crash_base?
My concern, however, is that this kind of interface might prevent us from
allowing multiple regions to be reserved for crash dump kernel in the future.
(There is an assumption that we have only one region at least on arm64 though.)
Ok.

If we need to handle that, we should also update the description of
linux,usable-memory-range to allow multiple entries (and probably
s/range/ranges).

Thanks,
Mark.
Pratyush Anand
2017-01-06 03:26:30 UTC
Permalink
Post by AKASHI Takahiro
This patch series adds kdump support on arm64.
To load a crash-dump kernel to the systems, a series of patches to
kexec-tools[1] are also needed. Please use the latest one, v4 [2].
https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump
To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
- crash utility (v7.1.7 or later) [3]
(As of today, the very latest tree cannot handle a core image correctly
due to the commit 7fd8329ba502 ("taint/module: Clean up global and module
taint flags handling")
You have my "Tested-by" for this version as well :-). Patches works
great after applying them over v4.10-rc2.

I used crash-utility from below where there are few patches queued for
crash-7.1.8
https://github.com/crash-utility/crash

~Pratyush
AKASHI Takahiro
2017-01-06 04:34:27 UTC
Permalink
Post by AKASHI Takahiro
This patch series adds kdump support on arm64.
To load a crash-dump kernel to the systems, a series of patches to
kexec-tools[1] are also needed. Please use the latest one, v4 [2].
https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump
To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
- crash utility (v7.1.7 or later) [3]
(As of today, the very latest tree cannot handle a core image correctly
due to the commit 7fd8329ba502 ("taint/module: Clean up global and module
taint flags handling")
You have my "Tested-by" for this version as well :-). Patches works great
after applying them over v4.10-rc2.
Great thanks, again.
I used crash-utility from below where there are few patches queued for
crash-7.1.8
https://github.com/crash-utility/crash
Yes, we need some patch for v4.10 kernel.

-Takahiro AKASHI
~Pratyush
Loading...