Discussion:
[PATCH v31 00/12] add kdump support
AKASHI Takahiro
2017-02-01 12:42:18 UTC
Permalink
This patch series adds kdump support on arm64.

To load a crash-dump kernel to the systems, a series of patches to
kexec-tools[1] are also needed. Please use the latest one, v5 [2].
For your convinience, you can pick them up from:
https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump

To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
- crash utility (v7.1.8 or later, i.e. master for now) [3]

I tested this patchset on fast model and hikey.
(under various configurations, including 4KB-page/39,48-bits, 64KB-page/
42-bits with hibernate)

The previous versions were also:
Tested-by: Pratyush Anand <***@redhat.com> (v29, mustang and seattle)
Tested-by: James Morse <***@arm.com> (v27, Juno)

Changes for v31 (Feb 1, 2017)
o add/use remove_pgd_mapping() instead of modifying (__)create_pgd_mapping()
to protect crash dump kernel memory (patch #4,5)
o fix an issue at the isolation of crash dump kernel memory in
map_mem()/__map_memblock(), adding map_crashkernel() (patch#5)
o preserve the contents of crash dump kernel memory around hibernation
(patch#6)

Changes for v30 (Jan 24, 2017)
o rebased to Linux-v4.10-rc5
o remove "linux,crashkernel-base/size" from exported device tree
o protect memory region for crash-dump kernel (adding patch#4,5)
o remove "in_crash_kexec" variable
o and other trivial changes

Changes for v29 (Dec 28, 2016)
o rebased to Linux-v4.10-rc1
o change asm constraints in crash_setup_regs() per Catalin

Changes for v28 (Nov 22, 2016)
o rebased to Linux-v4.9-rc6
o revamp patch #1 and merge memblock_cap_memory_range() with
memblock_mem_limit_remove_map()

Changes for v27 (Nov 1, 2016)
o rebased to Linux-v4.9-rc3
o revert v26 change, i.e. revive "linux,usable-memory-range" property
(patch #2/#3, updating patch #9)
o minor fixes per review comments (patch #3/#4/#6/#8)
o re-order patches and improve commit messages for readability

Changes for v26 (Sep 7, 2016):
o Use /reserved-memory instead of "linux,usable-memory-range" property
(dropping v25's patch#2 and #3, updating ex-patch#9.)

Changes for v25 (Aug 29, 2016):
o Rebase to Linux-4.8-rc4
o Use memremap() instead of ioremap_cache() [patch#5]

Changes for v24 (Aug 9, 2016):
o Rebase to Linux-4.8-rc1
o Update descriptions about newly added DT proerties

Changes for v23 (July 26, 2016):

o Move memblock_reserve() to a single place in reserve_crashkernel()
o Use cpu_park_loop() in ipi_cpu_crash_stop()
o Always enforce ARCH_LOW_ADDRESS_LIMIT to the memory range of crash kernel
o Re-implement fdt_enforce_memory_region() to remove non-reserve regions
(for ACPI) from usable memory at crash kernel

Changes for v22 (July 12, 2016):

o Export "crashkernel-base" and "crashkernel-size" via device-tree,
and add some descriptions about them in chosen.txt
o Rename "usable-memory" to "usable-memory-range" to avoid inconsistency
with powerpc's "usable-memory"
o Make cosmetic changes regarding "ifdef" usage
o Correct some wordings in kdump.txt

Changes for v21 (July 6, 2016):

o Remove kexec patches.
o Rebase to arm64's for-next/core (Linux-4.7-rc4 based).
o Clarify the description about kvm in kdump.txt.

See the link [4] for older changes.


[1] https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
[2] http://lists.infradead.org/pipermail/kexec/2017-January/018002.html
[3] https://github.com/crash-utility/crash.git
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-June/438780.html

AKASHI Takahiro (11):
memblock: add memblock_cap_memory_range()
arm64: limit memory regions based on DT property, usable-memory-range
arm64: kdump: reserve memory for crash dump kernel
arm64: mm: allow for unmapping part of kernel mapping
arm64: kdump: protect crash dump kernel memory
arm64: hibernate: preserve kdump image around hibernation
arm64: kdump: implement machine_crash_shutdown()
arm64: kdump: add VMCOREINFO's for user-space tools
arm64: kdump: provide /proc/vmcore file
arm64: kdump: enable kdump in defconfig
Documentation: kdump: describe arm64 port

James Morse (1):
Documentation: dt: chosen properties for arm64 kdump

Documentation/devicetree/bindings/chosen.txt | 37 ++++++
Documentation/kdump/kdump.txt | 16 ++-
arch/arm64/Kconfig | 11 ++
arch/arm64/configs/defconfig | 1 +
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 42 ++++++-
arch/arm64/include/asm/mmu.h | 2 +
arch/arm64/include/asm/smp.h | 2 +
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/crash_dump.c | 71 ++++++++++++
arch/arm64/kernel/hibernate.c | 13 +++
arch/arm64/kernel/machine_kexec.c | 129 +++++++++++++++++----
arch/arm64/kernel/setup.c | 7 +-
arch/arm64/kernel/smp.c | 63 ++++++++++
arch/arm64/mm/init.c | 150 ++++++++++++++++++++++++
arch/arm64/mm/mmu.c | 165 ++++++++++++++++++++++++++-
include/linux/memblock.h | 1 +
mm/memblock.c | 44 ++++---
18 files changed, 709 insertions(+), 48 deletions(-)
create mode 100644 arch/arm64/kernel/crash_dump.c
--
2.11.0
AKASHI Takahiro
2017-02-01 12:45:39 UTC
Permalink
Add memblock_cap_memory_range() which will remove all the memblock regions
except the memory range specified in the arguments. In addition, rework is
done on memblock_mem_limit_remove_map() to re-implement it using
memblock_cap_memory_range().

This function, like memblock_mem_limit_remove_map(), will not remove
memblocks with MEMMAP_NOMAP attribute as they may be mapped and accessed
later as "device memory."
See the commit a571d4eb55d8 ("mm/memblock.c: add new infrastructure to
address the mem limit issue").

This function is used, in a succeeding patch in the series of arm64 kdump
suuport, to limit the range of usable memory, or System RAM, on crash dump
kernel.
(Please note that "mem=" parameter is of little use for this purpose.)

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Will Deacon <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
Acked-by: Dennis Chen <***@arm.com>
Cc: linux-***@kvack.org
Cc: Andrew Morton <***@linux-foundation.org>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 44 +++++++++++++++++++++++++++++---------------
2 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5b759c9acf97..fbfcacc50c29 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -333,6 +333,7 @@ phys_addr_t memblock_mem_size(unsigned long limit_pfn);
phys_addr_t memblock_start_of_DRAM(void);
phys_addr_t memblock_end_of_DRAM(void);
void memblock_enforce_memory_limit(phys_addr_t memory_limit);
+void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
void memblock_mem_limit_remove_map(phys_addr_t limit);
bool memblock_is_memory(phys_addr_t addr);
int memblock_is_map_memory(phys_addr_t addr);
diff --git a/mm/memblock.c b/mm/memblock.c
index 7608bc305936..fea1688fef60 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1514,11 +1514,37 @@ void __init memblock_enforce_memory_limit(phys_addr_t limit)
(phys_addr_t)ULLONG_MAX);
}

+void __init memblock_cap_memory_range(phys_addr_t base, phys_addr_t size)
+{
+ int start_rgn, end_rgn;
+ int i, ret;
+
+ if (!size)
+ return;
+
+ ret = memblock_isolate_range(&memblock.memory, base, size,
+ &start_rgn, &end_rgn);
+ if (ret)
+ return;
+
+ /* remove all the MAP regions */
+ for (i = memblock.memory.cnt - 1; i >= end_rgn; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ for (i = start_rgn - 1; i >= 0; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ /* truncate the reserved regions */
+ memblock_remove_range(&memblock.reserved, 0, base);
+ memblock_remove_range(&memblock.reserved,
+ base + size, (phys_addr_t)ULLONG_MAX);
+}
+
void __init memblock_mem_limit_remove_map(phys_addr_t limit)
{
- struct memblock_type *type = &memblock.memory;
phys_addr_t max_addr;
- int i, ret, start_rgn, end_rgn;

if (!limit)
return;
@@ -1529,19 +1555,7 @@ void __init memblock_mem_limit_remove_map(phys_addr_t limit)
if (max_addr == (phys_addr_t)ULLONG_MAX)
return;

- ret = memblock_isolate_range(type, max_addr, (phys_addr_t)ULLONG_MAX,
- &start_rgn, &end_rgn);
- if (ret)
- return;
-
- /* remove all the MAP regions above the limit */
- for (i = end_rgn - 1; i >= start_rgn; i--) {
- if (!memblock_is_nomap(&type->regions[i]))
- memblock_remove_region(type, i);
- }
- /* truncate the reserved regions */
- memblock_remove_range(&memblock.reserved, max_addr,
- (phys_addr_t)ULLONG_MAX);
+ memblock_cap_memory_range(0, max_addr);
}

static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
--
2.11.0
AKASHI Takahiro
2017-02-01 12:46:21 UTC
Permalink
Crash dump kernel utilizes only a subset of available memory as System RAM.
On arm64 kdump, This memory range is advertized to crash dump kernel via
a device-tree property under /chosen,
linux,usable-memory-range = <BASE SIZE>

Crash dump kernel reads this property at boot time and calls
memblock_cap_memory_range() to limit usable memory ranges which are
described as entries in UEFI memory map table or "memory" nodes in
a device tree blob.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Geoff Levand <***@infradead.org>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/mm/init.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 380ebe705093..6cddb566eb21 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -187,10 +187,45 @@ static int __init early_mem(char *p)
}
early_param("mem", early_mem);

+static int __init early_init_dt_scan_usablemem(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ struct memblock_region *usablemem = (struct memblock_region *)data;
+ const __be32 *reg;
+ int len;
+
+ usablemem->size = 0;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,usable-memory-range", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ usablemem->base = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ usablemem->size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+static void __init fdt_enforce_memory_region(void)
+{
+ struct memblock_region reg;
+
+ of_scan_flat_dt(early_init_dt_scan_usablemem, &reg);
+
+ if (reg.size)
+ memblock_cap_memory_range(reg.base, reg.size);
+}
+
void __init arm64_memblock_init(void)
{
const s64 linear_region_size = -(s64)PAGE_OFFSET;

+ /* Handle linux,usable-memory-range property */
+ fdt_enforce_memory_region();
+
/*
* Ensure that the linear region takes up exactly half of the kernel
* virtual address space. This way, we can distinguish a linear address
--
2.11.0
Mark Rutland
2017-02-01 15:07:00 UTC
Permalink
Hi,
Post by AKASHI Takahiro
+static int __init early_init_dt_scan_usablemem(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ struct memblock_region *usablemem = (struct memblock_region *)data;
Nit: unnecessary cast.
Post by AKASHI Takahiro
+ const __be32 *reg;
+ int len;
+
+ usablemem->size = 0;
Could we please lift this assignment/initialisation into the caller...
Post by AKASHI Takahiro
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,usable-memory-range", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ usablemem->base = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ usablemem->size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+static void __init fdt_enforce_memory_region(void)
+{
+ struct memblock_region reg;
... e.g. have:

struct memblock_region reg = {
.size = 0,
};

That saves us from making unnecessary assignments to the size field, and
makes it clear that reg.size has definitely been initialised, regardless
of what of_scan_flat_dt() happens to do.

With that:

Acked-by: Mark Rutland <***@arm.com>

Thanks,
Mark.
Post by AKASHI Takahiro
+
+ of_scan_flat_dt(early_init_dt_scan_usablemem, &reg);
+
+ if (reg.size)
+ memblock_cap_memory_range(reg.base, reg.size);
+}
+
void __init arm64_memblock_init(void)
{
const s64 linear_region_size = -(s64)PAGE_OFFSET;
+ /* Handle linux,usable-memory-range property */
+ fdt_enforce_memory_region();
+
/*
* Ensure that the linear region takes up exactly half of the kernel
* virtual address space. This way, we can distinguish a linear address
--
2.11.0
AKASHI Takahiro
2017-02-02 04:21:36 UTC
Permalink
Post by Mark Rutland
Hi,
Post by AKASHI Takahiro
+static int __init early_init_dt_scan_usablemem(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ struct memblock_region *usablemem = (struct memblock_region *)data;
Nit: unnecessary cast.
OK
Post by Mark Rutland
Post by AKASHI Takahiro
+ const __be32 *reg;
+ int len;
+
+ usablemem->size = 0;
Could we please lift this assignment/initialisation into the caller...
Post by AKASHI Takahiro
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,usable-memory-range", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ usablemem->base = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ usablemem->size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+static void __init fdt_enforce_memory_region(void)
+{
+ struct memblock_region reg;
struct memblock_region reg = {
.size = 0,
};
Sure
Post by Mark Rutland
That saves us from making unnecessary assignments to the size field, and
makes it clear that reg.size has definitely been initialised, regardless
of what of_scan_flat_dt() happens to do.
Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
Post by AKASHI Takahiro
+
+ of_scan_flat_dt(early_init_dt_scan_usablemem, &reg);
+
+ if (reg.size)
+ memblock_cap_memory_range(reg.base, reg.size);
+}
+
void __init arm64_memblock_init(void)
{
const s64 linear_region_size = -(s64)PAGE_OFFSET;
+ /* Handle linux,usable-memory-range property */
+ fdt_enforce_memory_region();
+
/*
* Ensure that the linear region takes up exactly half of the kernel
* virtual address space. This way, we can distinguish a linear address
--
2.11.0
AKASHI Takahiro
2017-02-01 12:46:23 UTC
Permalink
A new function, remove_pgd_mapping(), is added.
It allows us to unmap a specific portion of kernel mapping later as far as
the mapping is made using create_pgd_mapping() and unless we try to free
a sub-set of memory range within a section mapping.

This function will be used in a later kdump patch to implement
the protection against corrupting crash dump kernel memory which is to be
set aside.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
---
arch/arm64/include/asm/mmu.h | 2 +
arch/arm64/mm/mmu.c | 130 +++++++++++++++++++++++++++++++++++++++++--
2 files changed, 127 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 47619411f0ff..04eb240736d8 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -36,6 +36,8 @@ extern void init_mem_pgprot(void);
extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
unsigned long virt, phys_addr_t size,
pgprot_t prot, bool page_mappings_only);
+extern void remove_pgd_mapping(struct mm_struct *mm, unsigned long virt,
+ phys_addr_t size);
extern void *fixmap_remap_fdt(phys_addr_t dt_phys);

#endif
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..9d3cea1db3b4 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -334,12 +334,10 @@ static void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL, false);
}

-void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
- unsigned long virt, phys_addr_t size,
- pgprot_t prot, bool page_mappings_only)
+void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
+ unsigned long virt, phys_addr_t size,
+ pgprot_t prot, bool page_mappings_only)
{
- BUG_ON(mm == &init_mm);
-
__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
pgd_pgtable_alloc, page_mappings_only);
}
@@ -357,6 +355,128 @@ static void create_mapping_late(phys_addr_t phys, unsigned long virt,
NULL, debug_pagealloc_enabled());
}

+static bool pgtable_is_cleared(void *pgtable)
+{
+ unsigned long *desc, *end;
+
+ for (desc = pgtable, end = (unsigned long *)(pgtable + PAGE_SIZE);
+ desc < end; desc++)
+ if (*desc)
+ return false;
+
+ return true;
+}
+
+static void clear_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
+{
+ pte_t *pte;
+
+ if (WARN_ON(pmd_bad(*pmd)))
+ return;
+
+ pte = pte_set_fixmap_offset(pmd, addr);
+
+ do {
+ pte_clear(NULL, NULL, pte);
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+
+ pte_clear_fixmap();
+}
+
+static void clear_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)
+{
+ pmd_t *pmd;
+ unsigned long next;
+
+ if (WARN_ON(pud_bad(*pud)))
+ return;
+
+ pmd = pmd_set_fixmap_offset(pud, addr);
+
+ do {
+ next = pmd_addr_end(addr, end);
+
+ if (pmd_table(*pmd)) {
+ clear_pte_range(pmd, addr, next);
+ if (((next - addr) == PMD_SIZE) ||
+ pgtable_is_cleared(__va(pmd_page_paddr(*pmd)))) {
+ __free_page(pmd_page(*pmd));
+ pmd_clear(pmd);
+ }
+ } else {
+ if (WARN_ON((next - addr) != PMD_SIZE))
+ return;
+ pmd_clear(pmd);
+ }
+ } while (pmd++, addr = next, addr != end);
+
+ pmd_clear_fixmap();
+}
+
+static void clear_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+ pud_t *pud;
+ unsigned long next;
+
+ if (WARN_ON(pgd_bad(*pgd)))
+ return;
+
+ pud = pud_set_fixmap_offset(pgd, addr);
+
+ do {
+ next = pud_addr_end(addr, end);
+
+ if (pud_table(*pud)) {
+ clear_pmd_range(pud, addr, next);
+#if CONFIG_PGTABLE_LEVELS > 2
+ if (((next - addr) == PUD_SIZE) ||
+ pgtable_is_cleared(__va(pud_page_paddr(*pud)))) {
+ __free_page(pud_page(*pud));
+ pud_clear(pud);
+ }
+#endif
+ } else {
+#if CONFIG_PGTABLE_LEVELS > 2
+ if (WARN_ON((next - addr) != PUD_SIZE))
+ return;
+ pud_clear(pud);
+#endif
+ }
+ } while (pud++, addr = next, addr != end);
+
+ pud_clear_fixmap();
+}
+
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long virt,
+ phys_addr_t size)
+{
+ unsigned long addr, length, end, next;
+ pgd_t *pgd = pgd_offset_raw(pgdir, virt);
+
+ addr = virt & PAGE_MASK;
+ length = PAGE_ALIGN(size + (virt & ~PAGE_MASK));
+
+ end = addr + length;
+ do {
+ next = pgd_addr_end(addr, end);
+ clear_pud_range(pgd, addr, next);
+
+#if CONFIG_PGTABLE_LEVELS > 3
+ if (((next - addr) == PGD_SIZE) ||
+ pgtable_is_cleared(__va(pgd_page_paddr(*pgd)))) {
+ __free_page(pgd_page(*pgd));
+ pgd_clear(pgd);
+ }
+#endif
+ } while (pgd++, addr = next, addr != end);
+}
+
+void remove_pgd_mapping(struct mm_struct *mm, unsigned long virt,
+ phys_addr_t size)
+{
+ __remove_pgd_mapping(mm->pgd, virt, size);
+}
+
static void __init __map_memblock(pgd_t *pgd, phys_addr_t start, phys_addr_t end)
{
unsigned long kernel_start = __pa(_text);
--
2.11.0
Mark Rutland
2017-02-01 16:03:54 UTC
Permalink
Hi,
Post by AKASHI Takahiro
A new function, remove_pgd_mapping(), is added.
It allows us to unmap a specific portion of kernel mapping later as far as
the mapping is made using create_pgd_mapping() and unless we try to free
a sub-set of memory range within a section mapping.
I'm not keen on adding more page table modification code. It was painful
enough to ensure that those worked in all configurations.

Why can't we reuse create_pgd_mapping()? If we pass page_mappings_only,
and use an invalid prot (i.e. 0), what is the problem?

I can see that we wouldn't free/reallocate the pud or pmd entries, but
the "wasted" memory should be small. If anything, I'd argue that it's
preferable to keep that around so that we don't have to allocate memory
when we need to map the crashkernel region.

Is there another problem I'm missing?
Post by AKASHI Takahiro
+static bool pgtable_is_cleared(void *pgtable)
+{
+ unsigned long *desc, *end;
+
+ for (desc = pgtable, end = (unsigned long *)(pgtable + PAGE_SIZE);
+ desc < end; desc++)
+ if (*desc)
+ return false;
+
+ return true;
+}
If nothing else, we should be using the normal accessors for this, e.g.
{pte,pmd,pud}_none().

As above, I'd very much like to reuse our existing page table routines
if we can.

Thanks,
Mark.
AKASHI Takahiro
2017-02-02 10:21:32 UTC
Permalink
Post by Mark Rutland
Hi,
Post by AKASHI Takahiro
A new function, remove_pgd_mapping(), is added.
It allows us to unmap a specific portion of kernel mapping later as far as
the mapping is made using create_pgd_mapping() and unless we try to free
a sub-set of memory range within a section mapping.
I'm not keen on adding more page table modification code. It was painful
enough to ensure that those worked in all configurations.
Why can't we reuse create_pgd_mapping()? If we pass page_mappings_only,
and use an invalid prot (i.e. 0), what is the problem?
As I did in v30?
(though my implementation in v30 should be improved.)
Post by Mark Rutland
I can see that we wouldn't free/reallocate the pud or pmd entries, but
the "wasted" memory should be small. If anything, I'd argue that it's
preferable to keep that around so that we don't have to allocate memory
when we need to map the crashkernel region.
Is there another problem I'm missing?
If we don't need to free unused page tables, that would make things
much simple. There are still some minor problems on the merge, but
we can sort it out.

Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Post by AKASHI Takahiro
+static bool pgtable_is_cleared(void *pgtable)
+{
+ unsigned long *desc, *end;
+
+ for (desc = pgtable, end = (unsigned long *)(pgtable + PAGE_SIZE);
+ desc < end; desc++)
+ if (*desc)
+ return false;
+
+ return true;
+}
If nothing else, we should be using the normal accessors for this, e.g.
{pte,pmd,pud}_none().
As above, I'd very much like to reuse our existing page table routines
if we can.
Thanks,
Mark.
Mark Rutland
2017-02-02 11:44:38 UTC
Permalink
Post by AKASHI Takahiro
Post by Mark Rutland
Hi,
Post by AKASHI Takahiro
A new function, remove_pgd_mapping(), is added.
It allows us to unmap a specific portion of kernel mapping later as far as
the mapping is made using create_pgd_mapping() and unless we try to free
a sub-set of memory range within a section mapping.
I'm not keen on adding more page table modification code. It was painful
enough to ensure that those worked in all configurations.
Why can't we reuse create_pgd_mapping()? If we pass page_mappings_only,
and use an invalid prot (i.e. 0), what is the problem?
As I did in v30?
(though my implementation in v30 should be improved.)
Something like that. I wasn't entirely sure why we needed to change
those functions so much, so I'm clearly missing something there. I'll go
have another look.
Post by AKASHI Takahiro
Post by Mark Rutland
I can see that we wouldn't free/reallocate the pud or pmd entries, but
the "wasted" memory should be small. If anything, I'd argue that it's
preferable to keep that around so that we don't have to allocate memory
when we need to map the crashkernel region.
Is there another problem I'm missing?
If we don't need to free unused page tables, that would make things
much simple. There are still some minor problems on the merge, but
we can sort it out.
I'm not sure I follow what you mean by 'on merge' here. Could you
elaborate?

Thanks,
Mark.
AKASHI Takahiro
2017-02-02 14:01:03 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Hi,
Post by AKASHI Takahiro
A new function, remove_pgd_mapping(), is added.
It allows us to unmap a specific portion of kernel mapping later as far as
the mapping is made using create_pgd_mapping() and unless we try to free
a sub-set of memory range within a section mapping.
I'm not keen on adding more page table modification code. It was painful
enough to ensure that those worked in all configurations.
Why can't we reuse create_pgd_mapping()? If we pass page_mappings_only,
and use an invalid prot (i.e. 0), what is the problem?
As I did in v30?
(though my implementation in v30 should be improved.)
Something like that. I wasn't entirely sure why we needed to change
those functions so much, so I'm clearly missing something there. I'll go
have another look.
I would be much easier if you see my new code.
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
I can see that we wouldn't free/reallocate the pud or pmd entries, but
the "wasted" memory should be small. If anything, I'd argue that it's
preferable to keep that around so that we don't have to allocate memory
when we need to map the crashkernel region.
Is there another problem I'm missing?
If we don't need to free unused page tables, that would make things
much simple. There are still some minor problems on the merge, but
we can sort it out.
I'm not sure I follow what you mean by 'on merge' here. Could you
elaborate?
What I had in mind is some changes needed to handle "__prot(0)" properly
in alloc_init_pxx(). For example, p[mu]d_set_huge() doesn't make
a "zeroed" entry.

-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
Mark Rutland
2017-02-02 14:35:35 UTC
Permalink
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Hi,
Post by AKASHI Takahiro
A new function, remove_pgd_mapping(), is added.
It allows us to unmap a specific portion of kernel mapping later as far as
the mapping is made using create_pgd_mapping() and unless we try to free
a sub-set of memory range within a section mapping.
I'm not keen on adding more page table modification code. It was painful
enough to ensure that those worked in all configurations.
Why can't we reuse create_pgd_mapping()? If we pass page_mappings_only,
and use an invalid prot (i.e. 0), what is the problem?
As I did in v30?
(though my implementation in v30 should be improved.)
Something like that. I wasn't entirely sure why we needed to change
those functions so much, so I'm clearly missing something there. I'll go
have another look.
I would be much easier if you see my new code.
Sure. FWIW, I took a look, and I understand why those changes were
necessary.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
If we don't need to free unused page tables, that would make things
much simple. There are still some minor problems on the merge, but
we can sort it out.
I'm not sure I follow what you mean by 'on merge' here. Could you
elaborate?
What I had in mind is some changes needed to handle "__prot(0)" properly
in alloc_init_pxx(). For example, p[mu]d_set_huge() doesn't make
a "zeroed" entry.
I think that if we only allow ourselves to make PTEs invalid, we don't
have to handle that case. If we use page_mappings_only, we should only
check pgattr_change_is_safe() for the pte level, and the {pmd,pud,pgd}
entries shouldn't change.

Is the below sufficient to allow that, or have I missed something?

Thanks,
Mark.

---->8----
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e4..05bf7bf 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -105,6 +105,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new)
return old == 0 || new == 0 || ((old ^ new) & ~mask) == 0;
}

+static bool pte_change_is_valid(pte old, pte new)
+{
+ /*
+ * So long as we subsequently perform TLB invalidation, it is safe to
+ * change a PTE to an invalid, but non-zero value. We only allow this
+ * for PTEs since there's no complicated allocation/free issues to deal
+ * with.
+ *
+ * Otherwise, the usual attribute change rules apply.
+ */
+ if (!pte_valid(old) || !pte_valid(new))
+ return true;
+
+ return pgattr_change_is_safe(pte_val(old), pte_val(new));
+}
+
static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
unsigned long end, unsigned long pfn,
pgprot_t prot,
@@ -143,11 +159,7 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
set_pte(pte, pfn_pte(pfn, __prot));
pfn++;

- /*
- * After the PTE entry has been populated once, we
- * only allow updates to the permission attributes.
- */
- BUG_ON(!pgattr_change_is_safe(pte_val(old_pte), pte_val(*pte)));
+ BUG_ON(!pte_change_is_valid(old_pte, pte));

} while (pte++, addr += PAGE_SIZE, addr != end);
AKASHI Takahiro
2017-02-02 14:55:54 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Hi,
Post by AKASHI Takahiro
A new function, remove_pgd_mapping(), is added.
It allows us to unmap a specific portion of kernel mapping later as far as
the mapping is made using create_pgd_mapping() and unless we try to free
a sub-set of memory range within a section mapping.
I'm not keen on adding more page table modification code. It was painful
enough to ensure that those worked in all configurations.
Why can't we reuse create_pgd_mapping()? If we pass page_mappings_only,
and use an invalid prot (i.e. 0), what is the problem?
As I did in v30?
(though my implementation in v30 should be improved.)
Something like that. I wasn't entirely sure why we needed to change
those functions so much, so I'm clearly missing something there. I'll go
have another look.
I would be much easier if you see my new code.
Sure. FWIW, I took a look, and I understand why those changes were
necessary.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
If we don't need to free unused page tables, that would make things
much simple. There are still some minor problems on the merge, but
we can sort it out.
I'm not sure I follow what you mean by 'on merge' here. Could you
elaborate?
What I had in mind is some changes needed to handle "__prot(0)" properly
in alloc_init_pxx(). For example, p[mu]d_set_huge() doesn't make
a "zeroed" entry.
I think that if we only allow ourselves to make PTEs invalid, we don't
have to handle that case. If we use page_mappings_only, we should only
check pgattr_change_is_safe() for the pte level, and the {pmd,pud,pgd}
entries shouldn't change.
Is the below sufficient to allow that, or have I missed something?
I think it will be OK, but will double-check tomorrow.
However, is is acceptable that create_pgd_mapping( __prot(0) ) can
only handle the cases of page-mapping-only?
That would be fine to kdump, but in general?

-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
---->8----
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e4..05bf7bf 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -105,6 +105,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new)
return old == 0 || new == 0 || ((old ^ new) & ~mask) == 0;
}
+static bool pte_change_is_valid(pte old, pte new)
+{
+ /*
+ * So long as we subsequently perform TLB invalidation, it is safe to
+ * change a PTE to an invalid, but non-zero value. We only allow this
+ * for PTEs since there's no complicated allocation/free issues to deal
+ * with.
+ *
+ * Otherwise, the usual attribute change rules apply.
+ */
+ if (!pte_valid(old) || !pte_valid(new))
+ return true;
+
+ return pgattr_change_is_safe(pte_val(old), pte_val(new));
+}
+
static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
unsigned long end, unsigned long pfn,
pgprot_t prot,
@@ -143,11 +159,7 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
set_pte(pte, pfn_pte(pfn, __prot));
pfn++;
- /*
- * After the PTE entry has been populated once, we
- * only allow updates to the permission attributes.
- */
- BUG_ON(!pgattr_change_is_safe(pte_val(old_pte), pte_val(*pte)));
+ BUG_ON(!pte_change_is_valid(old_pte, pte));
} while (pte++, addr += PAGE_SIZE, addr != end);
AKASHI Takahiro
2017-02-03 06:13:18 UTC
Permalink
Mark,
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Hi,
Post by AKASHI Takahiro
A new function, remove_pgd_mapping(), is added.
It allows us to unmap a specific portion of kernel mapping later as far as
the mapping is made using create_pgd_mapping() and unless we try to free
a sub-set of memory range within a section mapping.
I'm not keen on adding more page table modification code. It was painful
enough to ensure that those worked in all configurations.
Why can't we reuse create_pgd_mapping()? If we pass page_mappings_only,
and use an invalid prot (i.e. 0), what is the problem?
As I did in v30?
(though my implementation in v30 should be improved.)
Something like that. I wasn't entirely sure why we needed to change
those functions so much, so I'm clearly missing something there. I'll go
have another look.
I would be much easier if you see my new code.
Sure. FWIW, I took a look, and I understand why those changes were
necessary.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
If we don't need to free unused page tables, that would make things
much simple. There are still some minor problems on the merge, but
we can sort it out.
I'm not sure I follow what you mean by 'on merge' here. Could you
elaborate?
What I had in mind is some changes needed to handle "__prot(0)" properly
in alloc_init_pxx(). For example, p[mu]d_set_huge() doesn't make
a "zeroed" entry.
I think that if we only allow ourselves to make PTEs invalid, we don't
have to handle that case. If we use page_mappings_only, we should only
check pgattr_change_is_safe() for the pte level, and the {pmd,pud,pgd}
entries shouldn't change.
Is the below sufficient to allow that, or have I missed something?
I think it will be OK, but will double-check tomorrow.
However, is is acceptable that create_pgd_mapping( __prot(0) ) can
only handle the cases of page-mapping-only?
That would be fine to kdump, but in general?
My proposed code is attached below.
I think that the changes are quite trivial and it works even if
there is a section mapping as far as we refrain from reclaiming
unsed p[ug]d tables.

(Of course, we can't merely unmap a subset of a section mapping here.)

Thanks,
-Takahiro AKASHI

===8<===
arch/arm64/include/asm/pgtable-prot.h | 1 +
arch/arm64/mm/mmu.c | 36 +++++++++++++++++++++++++----------
2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 2142c7726e76..945d84cd5df7 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -54,6 +54,7 @@
#define PAGE_KERNEL_ROX __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_RDONLY)
#define PAGE_KERNEL_EXEC __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE)
#define PAGE_KERNEL_EXEC_CONT __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_CONT)
+#define PAGE_KERNEL_INVALID __pgprot(0)

#define PAGE_HYP __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_HYP_XN)
#define PAGE_HYP_EXEC __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..7f96eabc99d7 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -140,7 +140,11 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
__prot = prot;
}

- set_pte(pte, pfn_pte(pfn, __prot));
+ if (pgprot_val(prot) & PTE_VALID)
+ set_pte(pte, pfn_pte(pfn, __prot));
+ else
+ pte_clear(null, null, pte);
+
pfn++;

/*
@@ -185,7 +189,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,

/* try section mapping first */
if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
- !page_mappings_only) {
+ !page_mappings_only &&
+ (pmd_none(old_pmd) || pmd_sect(old_pmd))) {
/*
* Set the contiguous bit for the subsequent group of
* PMDs if its size and alignment are appropriate.
@@ -256,7 +261,8 @@ static void alloc_init_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
/*
* For 4K granule only, attempt to put down a 1GB block
*/
- if (use_1G_block(addr, next, phys) && !page_mappings_only) {
+ if (use_1G_block(addr, next, phys) && !page_mappings_only &&
+ (pud_none(old_pud) || pud_sect(old_pud))) {
pud_set_huge(pud, phys, prot);

/*
@@ -334,12 +340,10 @@ static void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL, false);
}

-void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
- unsigned long virt, phys_addr_t size,
- pgprot_t prot, bool page_mappings_only)
+void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
+ unsigned long virt, phys_addr_t size,
+ pgprot_t prot, bool page_mappings_only)
{
- BUG_ON(mm == &init_mm);
-
__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
pgd_pgtable_alloc, page_mappings_only);
}
@@ -791,14 +795,26 @@ int __init arch_ioremap_pmd_supported(void)
int pud_set_huge(pud_t *pud, phys_addr_t phys, pgprot_t prot)
{
BUG_ON(phys & ~PUD_MASK);
- set_pud(pud, __pud(phys | PUD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+
+ if (pgprot_val(prot) & PTE_VALID)
+ set_pud(pud, __pud(phys | PUD_TYPE_SECT |
+ pgprot_val(mk_sect_prot(prot))));
+ else
+ pud_clear(pud);
+
return 1;
}

int pmd_set_huge(pmd_t *pmd, phys_addr_t phys, pgprot_t prot)
{
BUG_ON(phys & ~PMD_MASK);
- set_pmd(pmd, __pmd(phys | PMD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+
+ if (pgprot_val(prot) & PTE_VALID)
+ set_pmd(pmd, __pmd(phys | PMD_TYPE_SECT |
+ pgprot_val(mk_sect_prot(prot))));
+ else
+ pmd_clear(pmd);
+
return 1;
}
===>8===
Mark Rutland
2017-02-03 14:22:16 UTC
Permalink
Hi,
Post by AKASHI Takahiro
Post by AKASHI Takahiro
Post by Mark Rutland
I think that if we only allow ourselves to make PTEs invalid, we don't
have to handle that case. If we use page_mappings_only, we should only
check pgattr_change_is_safe() for the pte level, and the {pmd,pud,pgd}
entries shouldn't change.
Is the below sufficient to allow that, or have I missed something?
I think it will be OK, but will double-check tomorrow.
However, is is acceptable that create_pgd_mapping( __prot(0) ) can
only handle the cases of page-mapping-only?
That would be fine to kdump, but in general?
Given we're only going to use this for page mappings, I think it's fine
(and preferable) to restrict it to page mappings for now. Until we need
to do this for the pmd/pud/pgd levels, those won't see any testing, and
it will be very easy for us to accidentally break this.
Post by AKASHI Takahiro
My proposed code is attached below.
I think that the changes are quite trivial and it works even if
there is a section mapping as far as we refrain from reclaiming
unsed p[ug]d tables.
(Of course, we can't merely unmap a subset of a section mapping here.)
Sure. We have a similar restriction when changing permissions, so I
think that's fine.
Post by AKASHI Takahiro
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..7f96eabc99d7 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -140,7 +140,11 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
__prot = prot;
}
- set_pte(pte, pfn_pte(pfn, __prot));
+ if (pgprot_val(prot) & PTE_VALID)
+ set_pte(pte, pfn_pte(pfn, __prot));
+ else
+ pte_clear(null, null, pte);
It took me a moment to figure out how this line could compile. ;)

I'm happy to take this approach in alloc_init_pte(), but as above I'd
prefer that we only handled it there, and left the pmd/pud/pgd code
as-is for now.

We can/should add a comment to make that clear.

Thanks,
Mark.

AKASHI Takahiro
2017-02-01 12:46:24 UTC
Permalink
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.

The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()

Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
---
arch/arm64/kernel/machine_kexec.c | 65 ++++++++++++++++++++++++++-------------
arch/arm64/mm/mmu.c | 35 +++++++++++++++++++++
2 files changed, 79 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index bc96c8a7fc79..016f2dd693aa 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -14,6 +14,7 @@

#include <asm/cacheflush.h>
#include <asm/cpu_ops.h>
+#include <asm/mmu.h>
#include <asm/mmu_context.h>

#include "cpu-reset.h"
@@ -22,8 +23,6 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;

-static unsigned long kimage_start;
-
/**
* kexec_image_info - For debugging output.
*/
@@ -64,7 +63,7 @@ void machine_kexec_cleanup(struct kimage *kimage)
*/
int machine_kexec_prepare(struct kimage *kimage)
{
- kimage_start = kimage->start;
+ void *reboot_code_buffer;

kexec_image_info(kimage);

@@ -73,6 +72,21 @@ int machine_kexec_prepare(struct kimage *kimage)
return -EBUSY;
}

+ reboot_code_buffer =
+ phys_to_virt(page_to_phys(kimage->control_code_page));
+
+ /*
+ * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
+ * after the kernel is shut down.
+ */
+ memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
+ arm64_relocate_new_kernel_size);
+
+ /* Flush the reboot_code_buffer in preparation for its execution. */
+ __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
+ flush_icache_range((uintptr_t)reboot_code_buffer,
+ arm64_relocate_new_kernel_size);
+
return 0;
}

@@ -143,7 +157,6 @@ static void kexec_segment_flush(const struct kimage *kimage)
void machine_kexec(struct kimage *kimage)
{
phys_addr_t reboot_code_buffer_phys;
- void *reboot_code_buffer;

/*
* New cpus may have become stuck_in_kernel after we loaded the image.
@@ -151,7 +164,6 @@ void machine_kexec(struct kimage *kimage)
BUG_ON(cpus_are_stuck_in_kernel() || (num_online_cpus() > 1));

reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
- reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);

kexec_image_info(kimage);

@@ -159,31 +171,17 @@ void machine_kexec(struct kimage *kimage)
kimage->control_code_page);
pr_debug("%s:%d: reboot_code_buffer_phys: %pa\n", __func__, __LINE__,
&reboot_code_buffer_phys);
- pr_debug("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__,
- reboot_code_buffer);
pr_debug("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__,
arm64_relocate_new_kernel);
pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
__func__, __LINE__, arm64_relocate_new_kernel_size,
arm64_relocate_new_kernel_size);

- /*
- * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
- * after the kernel is shut down.
- */
- memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
- arm64_relocate_new_kernel_size);
-
- /* Flush the reboot_code_buffer in preparation for its execution. */
- __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
- flush_icache_range((uintptr_t)reboot_code_buffer,
- arm64_relocate_new_kernel_size);
-
/* Flush the kimage list and its buffers. */
kexec_list_flush(kimage);

/* Flush the new image if already in place. */
- if (kimage->head & IND_DONE)
+ if ((kimage != kexec_crash_image) && (kimage->head & IND_DONE))
kexec_segment_flush(kimage);

pr_info("Bye!\n");
@@ -201,7 +199,7 @@ void machine_kexec(struct kimage *kimage)
*/

cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head,
- kimage_start, 0);
+ kimage->start, 0);

BUG(); /* Should never get here. */
}
@@ -210,3 +208,28 @@ void machine_crash_shutdown(struct pt_regs *regs)
{
/* Empty routine needed to avoid build errors. */
}
+
+void arch_kexec_protect_crashkres(void)
+{
+ kexec_segment_flush(kexec_crash_image);
+
+ remove_pgd_mapping(&init_mm, __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res));
+
+ flush_tlb_all();
+}
+
+void arch_kexec_unprotect_crashkres(void)
+{
+ /*
+ * We don't have to make page-level mappings here because
+ * the crash dump kernel memory is not allowed to be shrunk
+ * once the kernel is loaded.
+ */
+ create_pgd_mapping(&init_mm, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL,
+ debug_pagealloc_enabled());
+
+ flush_tlb_all();
+}
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9d3cea1db3b4..87861e62316a 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,8 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/ioport.h>
+#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
@@ -538,6 +540,24 @@ static void __init map_mem(pgd_t *pgd)
if (memblock_is_nomap(reg))
continue;

+#ifdef CONFIG_KEXEC_CORE
+ /*
+ * While crash dump kernel memory is contained in a single
+ * memblock for now, it should appear in an isolated mapping
+ * so that we can independently unmap the region later.
+ */
+ if (crashk_res.end &&
+ (start <= crashk_res.start) &&
+ ((crashk_res.end + 1) < end)) {
+ if (crashk_res.start != start)
+ __map_memblock(pgd, start, crashk_res.start);
+
+ if ((crashk_res.end + 1) < end)
+ __map_memblock(pgd, crashk_res.end + 1, end);
+
+ continue;
+ }
+#endif
__map_memblock(pgd, start, end);
}
}
@@ -623,6 +643,21 @@ static void __init map_kernel(pgd_t *pgd)
kasan_copy_shadow(pgd);
}

+#ifdef CONFIG_KEXEC_CORE
+static int __init map_crashkernel(void)
+{
+ /* page-level mapping only to allow for shrinking */
+ if (crashk_res.end)
+ create_pgd_mapping(&init_mm, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL,
+ true);
+
+ return 0;
+}
+subsys_initcall(map_crashkernel);
+#endif
+
/*
* paging_init() sets up the page tables, initialises the zone memory
* maps and sets up the zero page.
--
2.11.0
Mark Rutland
2017-02-01 18:00:08 UTC
Permalink
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.

Surely we need an arch-specific callback to handle that? e.g.

arch_crash_release_region(unsigned long base, unsigned long size)
{
/*
* Ensure the region is part of the linear map before we return
* it to the OS. We won't unmap this again, so we can use block
* mappings.
*/
create_pgd_mapping(&init_mm, start, __phys_to_virt(start),
size, PAGE_KERNEL, false);
}

... which we'd call from crash_shrink_memory() before we freed the
reserved pages.

[...]
Post by AKASHI Takahiro
+void arch_kexec_unprotect_crashkres(void)
+{
+ /*
+ * We don't have to make page-level mappings here because
+ * the crash dump kernel memory is not allowed to be shrunk
+ * once the kernel is loaded.
+ */
+ create_pgd_mapping(&init_mm, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL,
+ debug_pagealloc_enabled());
+
+ flush_tlb_all();
+}
We can lose the flush_tlb_all() here; TLBs aren't allowed to cache an
invalid entry, so there's nothing to remove from the TLBs.

[...]
Post by AKASHI Takahiro
@@ -538,6 +540,24 @@ static void __init map_mem(pgd_t *pgd)
if (memblock_is_nomap(reg))
continue;
+#ifdef CONFIG_KEXEC_CORE
+ /*
+ * While crash dump kernel memory is contained in a single
+ * memblock for now, it should appear in an isolated mapping
+ * so that we can independently unmap the region later.
+ */
+ if (crashk_res.end &&
+ (start <= crashk_res.start) &&
+ ((crashk_res.end + 1) < end)) {
+ if (crashk_res.start != start)
+ __map_memblock(pgd, start, crashk_res.start);
+
+ if ((crashk_res.end + 1) < end)
+ __map_memblock(pgd, crashk_res.end + 1, end);
+
+ continue;
+ }
+#endif
This wasn't quite what I had in mind. I had expected that here we would
isolate the ranges we wanted to avoid mapping (with a comment as to why
we couldn't move the memblock_isolate_range() calls earlier). In
map_memblock(), we'd skip those ranges entirely.

I believe the above isn't correct if we have a single memblock.memory
region covering both the crashkernel and kernel regions. In that case,
we'd erroneously map the portion which overlaps the kernel.

It seems there are a number of subtle problems here. :/

Thanks,
Mark.
Mark Rutland
2017-02-01 18:25:06 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.
Surely we need an arch-specific callback to handle that? e.g.
arch_crash_release_region(unsigned long base, unsigned long size)
{
/*
* Ensure the region is part of the linear map before we return
* it to the OS. We won't unmap this again, so we can use block
* mappings.
*/
create_pgd_mapping(&init_mm, start, __phys_to_virt(start),
size, PAGE_KERNEL, false);
}
... which we'd call from crash_shrink_memory() before we freed the
reserved pages.
Another question is (how) does hyp map this region?

Thanks,
Mark.
AKASHI Takahiro
2017-02-02 10:39:06 UTC
Permalink
Mark,
Post by Mark Rutland
Post by Mark Rutland
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.
Surely we need an arch-specific callback to handle that? e.g.
arch_crash_release_region(unsigned long base, unsigned long size)
{
/*
* Ensure the region is part of the linear map before we return
* it to the OS. We won't unmap this again, so we can use block
* mappings.
*/
create_pgd_mapping(&init_mm, start, __phys_to_virt(start),
size, PAGE_KERNEL, false);
}
... which we'd call from crash_shrink_memory() before we freed the
reserved pages.
Another question is (how) does hyp map this region?
I don't get your point here.
Hyp mode does care only physical memory in intermediate address, doesn't it?

If this is not a matter now, I will post v32 tomorrow :)

-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
Mark Rutland
2017-02-02 11:54:25 UTC
Permalink
Post by AKASHI Takahiro
Mark,
Post by Mark Rutland
Post by Mark Rutland
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.
Surely we need an arch-specific callback to handle that? e.g.
arch_crash_release_region(unsigned long base, unsigned long size)
{
/*
* Ensure the region is part of the linear map before we return
* it to the OS. We won't unmap this again, so we can use block
* mappings.
*/
create_pgd_mapping(&init_mm, start, __phys_to_virt(start),
size, PAGE_KERNEL, false);
}
... which we'd call from crash_shrink_memory() before we freed the
reserved pages.
Another question is (how) does hyp map this region?
I don't get your point here.
Hyp mode does care only physical memory in intermediate address, doesn't it?
My concern was that hyp may map the region; and thus buggy code at hyp
can corrupt the region (and/or hyp may conflict w.r.t. attributes).

We mght have to ensure hyp doesn't map the crashkernel region, and to
case us pain, disallow freeing of any part of the region.

I'll dig into this.

Thanks,
Mark.
Post by AKASHI Takahiro
If this is not a matter now, I will post v32 tomorrow :)
-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
AKASHI Takahiro
2017-02-03 01:45:39 UTC
Permalink
Mark,
Post by Mark Rutland
Post by AKASHI Takahiro
Mark,
Post by Mark Rutland
Post by Mark Rutland
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.
Surely we need an arch-specific callback to handle that? e.g.
arch_crash_release_region(unsigned long base, unsigned long size)
{
/*
* Ensure the region is part of the linear map before we return
* it to the OS. We won't unmap this again, so we can use block
* mappings.
*/
create_pgd_mapping(&init_mm, start, __phys_to_virt(start),
size, PAGE_KERNEL, false);
}
... which we'd call from crash_shrink_memory() before we freed the
reserved pages.
Another question is (how) does hyp map this region?
I don't get your point here.
Hyp mode does care only physical memory in intermediate address, doesn't it?
My concern was that hyp may map the region; and thus buggy code at hyp
can corrupt the region (and/or hyp may conflict w.r.t. attributes).
Grep'ing create_hyp_mappings() under arch/arm(64)/kvm shows that
we have only a few small regions of memory mapped in hyp mode.
I also confirmed that there is no active mapping for crash dump kernel
memory by checking mmu tables with DS-5 debugger.
Post by Mark Rutland
We mght have to ensure hyp doesn't map the crashkernel region, and to
case us pain, disallow freeing of any part of the region.
So I don't believe we need to worry such a case.

Thanks,
-Takahiro AKASHI
Post by Mark Rutland
I'll dig into this.
Thanks,
Mark.
Post by AKASHI Takahiro
If this is not a matter now, I will post v32 tomorrow :)
-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
Mark Rutland
2017-02-03 11:51:59 UTC
Permalink
Hi,
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Another question is (how) does hyp map this region?
I don't get your point here.
Hyp mode does care only physical memory in intermediate address, doesn't it?
My concern was that hyp may map the region; and thus buggy code at hyp
can corrupt the region (and/or hyp may conflict w.r.t. attributes).
Grep'ing create_hyp_mappings() under arch/arm(64)/kvm shows that
we have only a few small regions of memory mapped in hyp mode.
I also confirmed that there is no active mapping for crash dump kernel
memory by checking mmu tables with DS-5 debugger.
Ok, that sounds fine.

I was under the mistaken impression that hyp had a copy of the whole
linear mapping, but it seems it lazily maps that in (at page
granularity) as required. So there should be no alias for the
crashkernel region.
Post by AKASHI Takahiro
Post by Mark Rutland
We mght have to ensure hyp doesn't map the crashkernel region, and to
case us pain, disallow freeing of any part of the region.
So I don't believe we need to worry such a case.
I agree.

Thanks,
Mark.
James Morse
2017-02-02 10:45:58 UTC
Permalink
Hi Akashi, Mark,
Post by Mark Rutland
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.
if (kexec_crash_image) {
ret = -ENOENT;
goto unlock;
}
So it should only be possible to return memory to the allocators when there is
no crash image loaded, so the area is mapped.

What happens when we unload the crash image? It looks like an unload is a call
if (flags & KEXEC_ON_CRASH) {
dest_image = &kexec_crash_image;
if (kexec_crash_image)
arch_kexec_unprotect_crashkres();
So we unprotect the region when we unload the kernel causing it to be remapped.
Provided the load/protect and {load,unload}/unprotect are kept in sync, I think
this is safe.


Given the core code can unload a crash image, this hunk has me worried:
+void arch_kexec_unprotect_crashkres(void)
+{
+ /*
+ * We don't have to make page-level mappings here because
+ * the crash dump kernel memory is not allowed to be shrunk
+ * once the kernel is loaded.
+ */
+ create_pgd_mapping(&init_mm, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL,
+ debug_pagealloc_enabled());


I don't think this is true if the order is: load -> protect, unload ->
unprotect, shrink. The shrink will happen with potentially non-page-size mappings.


Thanks,

James
AKASHI Takahiro
2017-02-02 11:19:31 UTC
Permalink
James,
Post by James Morse
Hi Akashi, Mark,
Post by Mark Rutland
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.
if (kexec_crash_image) {
ret = -ENOENT;
goto unlock;
}
So it should only be possible to return memory to the allocators when there is
no crash image loaded, so the area is mapped.
What happens when we unload the crash image? It looks like an unload is a call
Thank you for this heads-up!
I've almost forgot this feature.
Post by James Morse
if (flags & KEXEC_ON_CRASH) {
dest_image = &kexec_crash_image;
if (kexec_crash_image)
arch_kexec_unprotect_crashkres();
So we unprotect the region when we unload the kernel causing it to be remapped.
Provided the load/protect and {load,unload}/unprotect are kept in sync, I think
this is safe.
+void arch_kexec_unprotect_crashkres(void)
+{
+ /*
+ * We don't have to make page-level mappings here because
+ * the crash dump kernel memory is not allowed to be shrunk
+ * once the kernel is loaded.
+ */
+ create_pgd_mapping(&init_mm, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL,
+ debug_pagealloc_enabled());
I don't think this is true if the order is: load -> protect, unload ->
unprotect, shrink. The shrink will happen with potentially non-page-size mappings.
So we have to always do page-mapping, here.

-Takahiro AKASHI
Post by James Morse
Thanks,
James
Mark Rutland
2017-02-02 11:48:46 UTC
Permalink
Hi,
Post by James Morse
Post by Mark Rutland
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.
if (kexec_crash_image) {
ret = -ENOENT;
goto unlock;
}
So it should only be possible to return memory to the allocators when there is
no crash image loaded, so the area is mapped.
Ah. That being the case, there should be no problem.

Any part given back to the linear map will be at page granularity, but
that won't result in any functional problem.
Post by James Morse
+void arch_kexec_unprotect_crashkres(void)
+{
+ /*
+ * We don't have to make page-level mappings here because
+ * the crash dump kernel memory is not allowed to be shrunk
+ * once the kernel is loaded.
+ */
+ create_pgd_mapping(&init_mm, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL,
+ debug_pagealloc_enabled());
I don't think this is true if the order is: load -> protect, unload ->
unprotect, shrink. The shrink will happen with potentially non-page-size mappings.
Let's consistently use page mappings for the crashkernel region.

Thanks,
Mark.
AKASHI Takahiro
2017-02-02 10:31:30 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.
The region is protected only when the crash dump kernel is loaded,
and after that, we are no longer able to shrink the region.
Post by Mark Rutland
Surely we need an arch-specific callback to handle that? e.g.
arch_crash_release_region(unsigned long base, unsigned long size)
{
/*
* Ensure the region is part of the linear map before we return
* it to the OS. We won't unmap this again, so we can use block
* mappings.
*/
create_pgd_mapping(&init_mm, start, __phys_to_virt(start),
size, PAGE_KERNEL, false);
}
... which we'd call from crash_shrink_memory() before we freed the
reserved pages.
All the memory is mapped by my map_crashkernel() at boot time.
Post by Mark Rutland
[...]
Post by AKASHI Takahiro
+void arch_kexec_unprotect_crashkres(void)
+{
+ /*
+ * We don't have to make page-level mappings here because
+ * the crash dump kernel memory is not allowed to be shrunk
+ * once the kernel is loaded.
+ */
+ create_pgd_mapping(&init_mm, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL,
+ debug_pagealloc_enabled());
+
+ flush_tlb_all();
+}
We can lose the flush_tlb_all() here; TLBs aren't allowed to cache an
invalid entry, so there's nothing to remove from the TLBs.
Ah, yes!
Post by Mark Rutland
[...]
Post by AKASHI Takahiro
@@ -538,6 +540,24 @@ static void __init map_mem(pgd_t *pgd)
if (memblock_is_nomap(reg))
continue;
+#ifdef CONFIG_KEXEC_CORE
+ /*
+ * While crash dump kernel memory is contained in a single
+ * memblock for now, it should appear in an isolated mapping
+ * so that we can independently unmap the region later.
+ */
+ if (crashk_res.end &&
+ (start <= crashk_res.start) &&
+ ((crashk_res.end + 1) < end)) {
+ if (crashk_res.start != start)
+ __map_memblock(pgd, start, crashk_res.start);
+
+ if ((crashk_res.end + 1) < end)
+ __map_memblock(pgd, crashk_res.end + 1, end);
+
+ continue;
+ }
+#endif
This wasn't quite what I had in mind. I had expected that here we would
isolate the ranges we wanted to avoid mapping (with a comment as to why
we couldn't move the memblock_isolate_range() calls earlier). In
map_memblock(), we'd skip those ranges entirely.
I believe the above isn't correct if we have a single memblock.memory
region covering both the crashkernel and kernel regions. In that case,
we'd erroneously map the portion which overlaps the kernel.
It seems there are a number of subtle problems here. :/
I didn't see any problems, but I will go back with memblock_isolate_range()
here in map_mem().

Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
Mark Rutland
2017-02-02 11:16:37 UTC
Permalink
Hi,
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.
The region is protected only when the crash dump kernel is loaded,
and after that, we are no longer able to shrink the region.
Ah, sorry. My misunderstanding strikes again. That should be fine; sorry
for the noise, and thanks for explaining.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
@@ -538,6 +540,24 @@ static void __init map_mem(pgd_t *pgd)
if (memblock_is_nomap(reg))
continue;
+#ifdef CONFIG_KEXEC_CORE
+ /*
+ * While crash dump kernel memory is contained in a single
+ * memblock for now, it should appear in an isolated mapping
+ * so that we can independently unmap the region later.
+ */
+ if (crashk_res.end &&
+ (start <= crashk_res.start) &&
+ ((crashk_res.end + 1) < end)) {
+ if (crashk_res.start != start)
+ __map_memblock(pgd, start, crashk_res.start);
+
+ if ((crashk_res.end + 1) < end)
+ __map_memblock(pgd, crashk_res.end + 1, end);
+
+ continue;
+ }
+#endif
This wasn't quite what I had in mind. I had expected that here we would
isolate the ranges we wanted to avoid mapping (with a comment as to why
we couldn't move the memblock_isolate_range() calls earlier). In
map_memblock(), we'd skip those ranges entirely.
I believe the above isn't correct if we have a single memblock.memory
region covering both the crashkernel and kernel regions. In that case,
we'd erroneously map the portion which overlaps the kernel.
It seems there are a number of subtle problems here. :/
I didn't see any problems, but I will go back with memblock_isolate_range()
here in map_mem().
Imagine we have phyiscal memory:

singe RAM bank: |---------------------------------------------------|
kernel image: |---|
crashkernel: |------|

... we reserve the image and crashkernel region, but these would still
remain part of the memory memblock, and we'd have a memblock layout
like:

memblock.memory: |---------------------------------------------------|
memblock.reserved: |---| |------|

... in map_mem() we iterate over memblock.memory, so we only have a
single entry to handle in this case. With the code above, we'd find that
it overlaps the crashk_res, and we'd map the parts which don't overlap,
e.g.

memblock.memory: |---------------------------------------------------|
crashkernel: |------|
mapped regions: |-----------------------------| |------------|

... hwoever, this means we've mapped the portion which overlaps with the
kernel's linear alias (i.e. the case that we try to handle in
__map_memblock()). What we actually wanted was:

memblock.memory: |---------------------------------------------------|
kernel image: |---|
crashkernel: |------|
mapped regions: |------| |----------------| |------------|



To handle all cases I think we have to isolate *both* the image and
crashkernel in map_mem(). That would leave use with:

memblock.memory: |------||---||----------------||------||------------|
memblock.reserved: |---| |------|

... so then we can check for overlap with either the kernel or
crashkernel in __map_memblock(), and return early, e.g.

__map_memblock(...)
if (overlaps_with_kernel(...))
return;
if (overlaps_with_crashekrenl(...))
return;

__create_pgd_mapping(...);
}

We can pull the kernel alias mapping out of __map_memblock() and put it
at the end of map_mem().

Does that make sense?

Thanks,
Mark.
AKASHI Takahiro
2017-02-02 14:36:04 UTC
Permalink
Post by Mark Rutland
Hi,
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called around kexec_load() in order to protect
the memory allocated for crash dump kernel once after it's loaded.
The protection is implemented here by unmapping the region rather than
making it read-only.
To make the things work correctly, we also have to
- put the region in an isolated, page-level mapping initially, and
- move copying kexec's control_code_page to machine_kexec_prepare()
Note that page-level mapping is also required to allow for shrinking
the size of memory, through /sys/kernel/kexec_crash_size, by any number
of multiple pages.
Looking at kexec_crash_size_store(), I don't see where memory returned
to the OS is mapped. AFAICT, if the region is protected when the user
shrinks the region, the memory will not be mapped, yet handed over to
the kernel for general allocation.
The region is protected only when the crash dump kernel is loaded,
and after that, we are no longer able to shrink the region.
Ah, sorry. My misunderstanding strikes again. That should be fine; sorry
for the noise, and thanks for explaining.
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
@@ -538,6 +540,24 @@ static void __init map_mem(pgd_t *pgd)
if (memblock_is_nomap(reg))
continue;
+#ifdef CONFIG_KEXEC_CORE
+ /*
+ * While crash dump kernel memory is contained in a single
+ * memblock for now, it should appear in an isolated mapping
+ * so that we can independently unmap the region later.
+ */
+ if (crashk_res.end &&
+ (start <= crashk_res.start) &&
+ ((crashk_res.end + 1) < end)) {
+ if (crashk_res.start != start)
+ __map_memblock(pgd, start, crashk_res.start);
+
+ if ((crashk_res.end + 1) < end)
+ __map_memblock(pgd, crashk_res.end + 1, end);
+
+ continue;
+ }
+#endif
This wasn't quite what I had in mind. I had expected that here we would
isolate the ranges we wanted to avoid mapping (with a comment as to why
we couldn't move the memblock_isolate_range() calls earlier). In
map_memblock(), we'd skip those ranges entirely.
I believe the above isn't correct if we have a single memblock.memory
region covering both the crashkernel and kernel regions. In that case,
we'd erroneously map the portion which overlaps the kernel.
It seems there are a number of subtle problems here. :/
I didn't see any problems, but I will go back with memblock_isolate_range()
here in map_mem().
singe RAM bank: |---------------------------------------------------|
kernel image: |---|
crashkernel: |------|
... we reserve the image and crashkernel region, but these would still
remain part of the memory memblock, and we'd have a memblock layout
memblock.memory: |---------------------------------------------------|
memblock.reserved: |---| |------|
... in map_mem() we iterate over memblock.memory, so we only have a
single entry to handle in this case. With the code above, we'd find that
it overlaps the crashk_res, and we'd map the parts which don't overlap,
e.g.
memblock.memory: |---------------------------------------------------|
crashkernel: |------|
mapped regions: |-----------------------------| |------------|
I'm afraid that you might be talking about my v30.
The code in v31 was a bit modified, and now
Post by Mark Rutland
... hwoever, this means we've mapped the portion which overlaps with the
kernel's linear alias (i.e. the case that we try to handle in
memblock.memory: |---------------------------------------------------|
kernel image: |---|
crashkernel: |------|
|-----------(A)---------------| |----(B)-----|

__map_memblock() is called against each of (A) and (B),
so I think we will get
Post by Mark Rutland
mapped regions: |------| |----------------| |------------|
this mapping.
Post by Mark Rutland
To handle all cases I think we have to isolate *both* the image and
memblock.memory: |------||---||----------------||------||------------|
memblock.reserved: |---| |------|
... so then we can check for overlap with either the kernel or
crashkernel in __map_memblock(), and return early, e.g.
__map_memblock(...)
if (overlaps_with_kernel(...))
return;
if (overlaps_with_crashekrenl(...))
return;
__create_pgd_mapping(...);
}
We can pull the kernel alias mapping out of __map_memblock() and put it
at the end of map_mem().
Does that make sense?
OK, I now understand your anticipation.

Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
Mark Rutland
2017-02-02 15:36:13 UTC
Permalink
Post by AKASHI Takahiro
Post by Mark Rutland
singe RAM bank: |---------------------------------------------------|
kernel image: |---|
crashkernel: |------|
... we reserve the image and crashkernel region, but these would still
remain part of the memory memblock, and we'd have a memblock layout
memblock.memory: |---------------------------------------------------|
memblock.reserved: |---| |------|
... in map_mem() we iterate over memblock.memory, so we only have a
single entry to handle in this case. With the code above, we'd find that
it overlaps the crashk_res, and we'd map the parts which don't overlap,
e.g.
memblock.memory: |---------------------------------------------------|
crashkernel: |------|
mapped regions: |-----------------------------| |------------|
I'm afraid that you might be talking about my v30.
The code in v31 was a bit modified, and now
Post by Mark Rutland
... hwoever, this means we've mapped the portion which overlaps with the
kernel's linear alias (i.e. the case that we try to handle in
memblock.memory: |---------------------------------------------------|
kernel image: |---|
crashkernel: |------|
|-----------(A)---------------| |----(B)-----|
__map_memblock() is called against each of (A) and (B),
so I think we will get
Post by Mark Rutland
mapped regions: |------| |----------------| |------------|
this mapping.
Ah, I see. You are correct; this will work.

Clearly I needed more coffee before considering this. :/
Post by AKASHI Takahiro
Post by Mark Rutland
To handle all cases I think we have to isolate *both* the image and
memblock.memory: |------||---||----------------||------||------------|
memblock.reserved: |---| |------|
... so then we can check for overlap with either the kernel or
crashkernel in __map_memblock(), and return early, e.g.
__map_memblock(...)
if (overlaps_with_kernel(...))
return;
if (overlaps_with_crashekrenl(...))
return;
__create_pgd_mapping(...);
}
We can pull the kernel alias mapping out of __map_memblock() and put it
at the end of map_mem().
Does that make sense?
OK, I now understand your anticipation.
Thinking about it, we can do this consistently in one place if we use nomap
(temporarily), which would allow us to simplify the existing code. We'd have to
introduce memblock_clear_nomap(), but that's less of an internal detail than
memblock_isolate_range(), so that seems reasonable.

I think we can handle this like:

/*
* Create a linear mapping for all memblock memory regions which are mappable.
*/
static void __init __map_mem(pgd_t *pgd)
{
struct memblock_region *reg;

/* map all the memory banks */
for_each_memblock(memory, reg) {
phys_addr_t start = reg->base;
phys_addr_t end = start + reg->size;

if (start >= end)
break;
if (memblock_is_nomap(reg))
continue;

__create_pgd_mapping(pgd, start, __phys_to_virt(start),
end - start, PAGE_KERNEL,
early_pgtable_alloc,
debug_pagealloc_enabled());
}
}

/*
* Create a linear mapping for mappable memory, handling regions which have
* special mapping requirements.
*/
static void __init map_mem(pgd_t *pgd)
{
unsigned long kernel_start = __pa(_text);
unsigned long kernel_end = __pa(__init_begin);

memblock_mark_nomap(kernel_start, kernel_end);

// mark crashkernel nomap here

__map_mem(pgd);

/*
* Map the linear alias of the [_text, __init_begin) interval as
* read-only/non-executable. This makes the contents of the
* region accessible to subsystems such as hibernate, but
* protects it from inadvertent modification or execution.
*/
__create_pgd_mapping(pgd, kernel_start, __phys_to_virt(kernel_start),
kernel_end - kernel_start, PAGE_KERNEL_RO,
early_pgtable_alloc, debug_pagealloc_enabled());
memblock_clear_nomap(kernel_start, kernel_end);

// map crashkernel here
// clear nomap for crashkernel
}

Thoughts?

Thanks,
Mark.
AKASHI Takahiro
2017-02-01 12:46:25 UTC
Permalink
Since arch_kexec_protect_crashkres() removes a mapping for crash dump
kernel memory, the loaded contents won't be preserved around hibernation.

In this patch, arch_kexec_(un)protect_crashkres() are additionally called
before/after hibernation so that the relevant region will be mapped again
and restored just as the other memory regions are.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
---
arch/arm64/kernel/hibernate.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index fe301cbcb442..111a849333ee 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -16,6 +16,7 @@
*/
#define pr_fmt(x) "hibernate: " x
#include <linux/cpu.h>
+#include <linux/kexec.h>
#include <linux/kvm_host.h>
#include <linux/mm.h>
#include <linux/pm.h>
@@ -289,6 +290,12 @@ int swsusp_arch_suspend(void)
local_dbg_save(flags);

if (__cpu_suspend_enter(&state)) {
+#ifdef CONFIG_KEXEC_CORE
+ /* make the crash dump kernel region mapped */
+ if (kexec_crash_image)
+ arch_kexec_unprotect_crashkres();
+#endif
+
sleep_cpu = smp_processor_id();
ret = swsusp_save();
} else {
@@ -300,6 +307,12 @@ int swsusp_arch_suspend(void)
if (el2_reset_needed())
dcache_clean_range(__hyp_idmap_text_start, __hyp_idmap_text_end);

+#ifdef CONFIG_KEXEC_CORE
+ /* make the crash dump kernel region unmapped */
+ if (kexec_crash_image)
+ arch_kexec_protect_crashkres();
+#endif
+
/*
* Tell the hibernation core that we've just restored
* the memory
--
2.11.0
AKASHI Takahiro
2017-02-01 12:46:26 UTC
Permalink
Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
and save registers' status in per-cpu ELF notes before starting crash
dump kernel. See kernel_kexec().
Even if not all secondary cpus have shut down, we do kdump anyway.

As we don't have to make non-boot(crashed) cpus offline (to preserve
correct status of cpus at crash dump) before shutting down, this patch
also adds a variant of smp_send_stop().

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 42 +++++++++++++++++++++++++-
arch/arm64/include/asm/smp.h | 2 ++
arch/arm64/kernel/machine_kexec.c | 55 +++++++++++++++++++++++++++++++---
arch/arm64/kernel/smp.c | 63 +++++++++++++++++++++++++++++++++++++++
5 files changed, 158 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/hardirq.h b/arch/arm64/include/asm/hardirq.h
index 8740297dac77..1473fc2f7ab7 100644
--- a/arch/arm64/include/asm/hardirq.h
+++ b/arch/arm64/include/asm/hardirq.h
@@ -20,7 +20,7 @@
#include <linux/threads.h>
#include <asm/irq.h>

-#define NR_IPI 6
+#define NR_IPI 7

typedef struct {
unsigned int __softirq_pending;
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 04744dc5fb61..f40ace1fa21a 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -40,7 +40,47 @@
static inline void crash_setup_regs(struct pt_regs *newregs,
struct pt_regs *oldregs)
{
- /* Empty routine needed to avoid build errors. */
+ if (oldregs) {
+ memcpy(newregs, oldregs, sizeof(*newregs));
+ } else {
+ u64 tmp1, tmp2;
+
+ __asm__ __volatile__ (
+ "stp x0, x1, [%2, #16 * 0]\n"
+ "stp x2, x3, [%2, #16 * 1]\n"
+ "stp x4, x5, [%2, #16 * 2]\n"
+ "stp x6, x7, [%2, #16 * 3]\n"
+ "stp x8, x9, [%2, #16 * 4]\n"
+ "stp x10, x11, [%2, #16 * 5]\n"
+ "stp x12, x13, [%2, #16 * 6]\n"
+ "stp x14, x15, [%2, #16 * 7]\n"
+ "stp x16, x17, [%2, #16 * 8]\n"
+ "stp x18, x19, [%2, #16 * 9]\n"
+ "stp x20, x21, [%2, #16 * 10]\n"
+ "stp x22, x23, [%2, #16 * 11]\n"
+ "stp x24, x25, [%2, #16 * 12]\n"
+ "stp x26, x27, [%2, #16 * 13]\n"
+ "stp x28, x29, [%2, #16 * 14]\n"
+ "mov %0, sp\n"
+ "stp x30, %0, [%2, #16 * 15]\n"
+
+ "/* faked current PSTATE */\n"
+ "mrs %0, CurrentEL\n"
+ "mrs %1, SPSEL\n"
+ "orr %0, %0, %1\n"
+ "mrs %1, DAIF\n"
+ "orr %0, %0, %1\n"
+ "mrs %1, NZCV\n"
+ "orr %0, %0, %1\n"
+ /* pc */
+ "adr %1, 1f\n"
+ "1:\n"
+ "stp %1, %0, [%2, #16 * 16]\n"
+ : "=&r" (tmp1), "=&r" (tmp2)
+ : "r" (newregs)
+ : "memory"
+ );
+ }
}

#endif /* __ASSEMBLY__ */
diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index d050d720a1b4..cea009f2657d 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -148,6 +148,8 @@ static inline void cpu_panic_kernel(void)
*/
bool cpus_are_stuck_in_kernel(void);

+extern void smp_send_crash_stop(void);
+
#endif /* ifndef __ASSEMBLY__ */

#endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 016f2dd693aa..c61dd7b7dca0 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -9,6 +9,9 @@
* published by the Free Software Foundation.
*/

+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
#include <linux/kexec.h>
#include <linux/smp.h>

@@ -161,7 +164,8 @@ void machine_kexec(struct kimage *kimage)
/*
* New cpus may have become stuck_in_kernel after we loaded the image.
*/
- BUG_ON(cpus_are_stuck_in_kernel() || (num_online_cpus() > 1));
+ BUG_ON((cpus_are_stuck_in_kernel() || (num_online_cpus() > 1)) &&
+ !WARN_ON(kimage == kexec_crash_image));

reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);

@@ -198,15 +202,58 @@ void machine_kexec(struct kimage *kimage)
* relocation is complete.
*/

- cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head,
- kimage->start, 0);
+ cpu_soft_restart(kimage != kexec_crash_image,
+ reboot_code_buffer_phys, kimage->head, kimage->start, 0);

BUG(); /* Should never get here. */
}

+static void machine_kexec_mask_interrupts(void)
+{
+ unsigned int i;
+ struct irq_desc *desc;
+
+ for_each_irq_desc(i, desc) {
+ struct irq_chip *chip;
+ int ret;
+
+ chip = irq_desc_get_chip(desc);
+ if (!chip)
+ continue;
+
+ /*
+ * First try to remove the active state. If this
+ * fails, try to EOI the interrupt.
+ */
+ ret = irq_set_irqchip_state(i, IRQCHIP_STATE_ACTIVE, false);
+
+ if (ret && irqd_irq_inprogress(&desc->irq_data) &&
+ chip->irq_eoi)
+ chip->irq_eoi(&desc->irq_data);
+
+ if (chip->irq_mask)
+ chip->irq_mask(&desc->irq_data);
+
+ if (chip->irq_disable && !irqd_irq_disabled(&desc->irq_data))
+ chip->irq_disable(&desc->irq_data);
+ }
+}
+
+/**
+ * machine_crash_shutdown - shutdown non-crashing cpus and save registers
+ */
void machine_crash_shutdown(struct pt_regs *regs)
{
- /* Empty routine needed to avoid build errors. */
+ local_irq_disable();
+
+ /* shutdown non-crashing cpus */
+ smp_send_crash_stop();
+
+ /* for crashing cpu */
+ crash_save_cpu(regs, smp_processor_id());
+ machine_kexec_mask_interrupts();
+
+ pr_info("Starting crashdump kernel...\n");
}

void arch_kexec_protect_crashkres(void)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index cb87234cfcf2..446c6d48f8ec 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
#include <linux/completion.h>
#include <linux/of.h>
#include <linux/irq_work.h>
+#include <linux/kexec.h>

#include <asm/alternative.h>
#include <asm/atomic.h>
@@ -74,6 +75,7 @@ enum ipi_msg_type {
IPI_RESCHEDULE,
IPI_CALL_FUNC,
IPI_CPU_STOP,
+ IPI_CPU_CRASH_STOP,
IPI_TIMER,
IPI_IRQ_WORK,
IPI_WAKEUP
@@ -753,6 +755,7 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
S(IPI_RESCHEDULE, "Rescheduling interrupts"),
S(IPI_CALL_FUNC, "Function call interrupts"),
S(IPI_CPU_STOP, "CPU stop interrupts"),
+ S(IPI_CPU_CRASH_STOP, "CPU stop (for crash dump) interrupts"),
S(IPI_TIMER, "Timer broadcast interrupts"),
S(IPI_IRQ_WORK, "IRQ work interrupts"),
S(IPI_WAKEUP, "CPU wake-up interrupts"),
@@ -827,6 +830,29 @@ static void ipi_cpu_stop(unsigned int cpu)
cpu_relax();
}

+#ifdef CONFIG_KEXEC_CORE
+static atomic_t waiting_for_crash_ipi;
+#endif
+
+static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
+{
+#ifdef CONFIG_KEXEC_CORE
+ crash_save_cpu(regs, cpu);
+
+ atomic_dec(&waiting_for_crash_ipi);
+
+ local_irq_disable();
+
+#ifdef CONFIG_HOTPLUG_CPU
+ if (cpu_ops[cpu]->cpu_die)
+ cpu_ops[cpu]->cpu_die(cpu);
+#endif
+
+ /* just in case */
+ cpu_park_loop();
+#endif
+}
+
/*
* Main handler for inter-processor interrupts
*/
@@ -857,6 +883,15 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
irq_exit();
break;

+ case IPI_CPU_CRASH_STOP:
+ if (IS_ENABLED(CONFIG_KEXEC_CORE)) {
+ irq_enter();
+ ipi_cpu_crash_stop(cpu, regs);
+
+ unreachable();
+ }
+ break;
+
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
case IPI_TIMER:
irq_enter();
@@ -929,6 +964,34 @@ void smp_send_stop(void)
cpumask_pr_args(cpu_online_mask));
}

+#ifdef CONFIG_KEXEC_CORE
+void smp_send_crash_stop(void)
+{
+ cpumask_t mask;
+ unsigned long timeout;
+
+ if (num_online_cpus() == 1)
+ return;
+
+ cpumask_copy(&mask, cpu_online_mask);
+ cpumask_clear_cpu(smp_processor_id(), &mask);
+
+ atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
+
+ pr_crit("SMP: stopping secondary CPUs\n");
+ smp_cross_call(&mask, IPI_CPU_CRASH_STOP);
+
+ /* Wait up to one second for other CPUs to stop */
+ timeout = USEC_PER_SEC;
+ while ((atomic_read(&waiting_for_crash_ipi) > 0) && timeout--)
+ udelay(1);
+
+ if (atomic_read(&waiting_for_crash_ipi) > 0)
+ pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
+ cpumask_pr_args(cpu_online_mask));
+}
+#endif
+
/*
* not supported here
*/
--
2.11.0
AKASHI Takahiro
2017-02-01 12:46:27 UTC
Permalink
In addition to common VMCOREINFO's defined in
crash_save_vmcoreinfo_init(), we need to know, for crash utility,
- kimage_voffset
- PHYS_OFFSET
to examine the contents of a dump file (/proc/vmcore) correctly
due to the introduction of KASLR (CONFIG_RANDOMIZE_BASE) in v4.6.

- VA_BITS
is also required for makedumpfile command.

arch_crash_save_vmcoreinfo() appends them to the dump file.
More VMCOREINFO's may be added later.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/kernel/machine_kexec.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index c61dd7b7dca0..1fcd5483c1bb 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -17,6 +17,7 @@

#include <asm/cacheflush.h>
#include <asm/cpu_ops.h>
+#include <asm/memory.h>
#include <asm/mmu.h>
#include <asm/mmu_context.h>

@@ -280,3 +281,13 @@ void arch_kexec_unprotect_crashkres(void)

flush_tlb_all();
}
+
+void arch_crash_save_vmcoreinfo(void)
+{
+ VMCOREINFO_NUMBER(VA_BITS);
+ /* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
+ vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
+ kimage_voffset);
+ vmcoreinfo_append_str("NUMBER(PHYS_OFFSET)=0x%llx\n",
+ PHYS_OFFSET);
+}
--
2.11.0
AKASHI Takahiro
2017-02-01 12:46:22 UTC
Permalink
"crashkernel=" kernel parameter specifies the size (and optionally
the start address) of the system ram used by crash dump kernel.
reserve_crashkernel() will allocate and reserve the memory at the startup
of primary kernel.

This memory range will be exported to userspace via an entry named
"Crash kernel" in /proc/iomem.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Signed-off-by: Mark Salter <***@redhat.com>
Signed-off-by: Pratyush Anand <***@redhat.com>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/kernel/setup.c | 7 +++++-
arch/arm64/mm/init.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index b051367e2149..515e9c6696df 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -31,7 +31,6 @@
#include <linux/screen_info.h>
#include <linux/init.h>
#include <linux/kexec.h>
-#include <linux/crash_dump.h>
#include <linux/root_dev.h>
#include <linux/cpu.h>
#include <linux/interrupt.h>
@@ -224,6 +223,12 @@ static void __init request_standard_resources(void)
if (kernel_data.start >= res->start &&
kernel_data.end <= res->end)
request_resource(res, &kernel_data);
+#ifdef CONFIG_KEXEC_CORE
+ /* Userspace will find "Crash kernel" region in /proc/iomem. */
+ if (crashk_res.end && crashk_res.start >= res->start &&
+ crashk_res.end <= res->end)
+ request_resource(res, &crashk_res);
+#endif
}
}

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 6cddb566eb21..2aba75dc7720 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -30,12 +30,14 @@
#include <linux/gfp.h>
#include <linux/memblock.h>
#include <linux/sort.h>
+#include <linux/of.h>
#include <linux/of_fdt.h>
#include <linux/dma-mapping.h>
#include <linux/dma-contiguous.h>
#include <linux/efi.h>
#include <linux/swiotlb.h>
#include <linux/vmalloc.h>
+#include <linux/kexec.h>

#include <asm/boot.h>
#include <asm/fixmap.h>
@@ -76,6 +78,63 @@ static int __init early_initrd(char *p)
early_param("initrd", early_initrd);
#endif

+#ifdef CONFIG_KEXEC_CORE
+/*
+ * reserve_crashkernel() - reserves memory for crash kernel
+ *
+ * This function reserves memory area given in "crashkernel=" kernel command
+ * line parameter. The memory reserved is used by dump capture kernel when
+ * primary kernel is crashing.
+ */
+static void __init reserve_crashkernel(void)
+{
+ unsigned long long crash_base, crash_size;
+ int ret;
+
+ ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+ &crash_size, &crash_base);
+ /* no crashkernel= or invalid value specified */
+ if (ret || !crash_size)
+ return;
+
+ crash_size = PAGE_ALIGN(crash_size);
+
+ if (crash_base == 0) {
+ /* Current arm64 boot protocol requires 2MB alignment */
+ crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
+ crash_size, SZ_2M);
+ if (crash_base == 0) {
+ pr_warn("Unable to allocate crashkernel (size:%llx)\n",
+ crash_size);
+ return;
+ }
+ } else {
+ /* User specifies base address explicitly. */
+ if (!memblock_is_region_memory(crash_base, crash_size) ||
+ memblock_is_region_reserved(crash_base, crash_size)) {
+ pr_warn("crashkernel has wrong address or size\n");
+ return;
+ }
+
+ if (!IS_ALIGNED(crash_base, SZ_2M)) {
+ pr_warn("crashkernel base address is not 2MB aligned\n");
+ return;
+ }
+ }
+ memblock_reserve(crash_base, crash_size);
+
+ pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
+ crash_size >> 20, crash_base >> 20);
+
+ crashk_res.start = crash_base;
+ crashk_res.end = crash_base + crash_size - 1;
+}
+#else
+static void __init reserve_crashkernel(void)
+{
+}
+#endif /* CONFIG_KEXEC_CORE */
+
/*
* Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
* currently assumes that for memory starting above 4G, 32-bit devices will
@@ -331,6 +390,9 @@ void __init arm64_memblock_init(void)
arm64_dma_phys_limit = max_zone_dma_phys();
else
arm64_dma_phys_limit = PHYS_MASK + 1;
+
+ reserve_crashkernel();
+
dma_contiguous_reserve(arm64_dma_phys_limit);

memblock_allow_resize();
--
2.11.0
Mark Rutland
2017-02-01 15:26:09 UTC
Permalink
Hi,
Post by AKASHI Takahiro
+#ifdef CONFIG_KEXEC_CORE
+/*
+ * reserve_crashkernel() - reserves memory for crash kernel
+ *
+ * This function reserves memory area given in "crashkernel=" kernel command
+ * line parameter. The memory reserved is used by dump capture kernel when
+ * primary kernel is crashing.
+ */
+static void __init reserve_crashkernel(void)
+{
+ unsigned long long crash_base, crash_size;
+ int ret;
+
+ ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+ &crash_size, &crash_base);
+ /* no crashkernel= or invalid value specified */
+ if (ret || !crash_size)
+ return;
+
+ crash_size = PAGE_ALIGN(crash_size);
+
+ if (crash_base == 0) {
+ /* Current arm64 boot protocol requires 2MB alignment */
+ crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
+ crash_size, SZ_2M);
+ if (crash_base == 0) {
+ pr_warn("Unable to allocate crashkernel (size:%llx)\n",
+ crash_size);
Nit: can we please make that "size: 0x%llx", so that it's always clearly
a hex number? e.g.

pr_warn("cannot allocate 0x%llx bytes for crashkernel\n",
crash_size);
Post by AKASHI Takahiro
+ return;
+ }
+ } else {
+ /* User specifies base address explicitly. */
+ if (!memblock_is_region_memory(crash_base, crash_size) ||
+ memblock_is_region_reserved(crash_base, crash_size)) {
+ pr_warn("crashkernel has wrong address or size\n");
+ return;
+ }
To better report the error, can we please split these up:

if (!memblock_is_region_memory(crash_base, crash_size)) {
pr_warn("cannot reserve crashkernel: region is not memory\n");
return;
}

if (!memblock_is_region_memory(crash_base, crash_size)) {
pr_warn("cannot reserve crashkernel: region overlaps reserved memory\n");
return;
}
Post by AKASHI Takahiro
+ if (!IS_ALIGNED(crash_base, SZ_2M)) {
+ pr_warn("crashkernel base address is not 2MB aligned\n");
+ return;
+ }
+ }
+ memblock_reserve(crash_base, crash_size);
+
+ pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
+ crash_size >> 20, crash_base >> 20);
We only page-align the size, so the MB will be a little off, but that's
probably OK. However, it would also be nicer to log the base as an
address.

Could we dump this as we do for the kernel memory layout? e.g.

pr_info("crashkernel reserved: 0x%016lx - 0x%016lx (%lld MB)\n",
crash_base, crash_base + crash_size, crash_size >> 20);

With those:

Acked-by: Mark Rutland <***@arm.com>

Thanks,
Mark.
AKASHI Takahiro
2017-02-02 04:52:36 UTC
Permalink
Post by Mark Rutland
Hi,
Post by AKASHI Takahiro
+#ifdef CONFIG_KEXEC_CORE
+/*
+ * reserve_crashkernel() - reserves memory for crash kernel
+ *
+ * This function reserves memory area given in "crashkernel=" kernel command
+ * line parameter. The memory reserved is used by dump capture kernel when
+ * primary kernel is crashing.
+ */
+static void __init reserve_crashkernel(void)
+{
+ unsigned long long crash_base, crash_size;
+ int ret;
+
+ ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+ &crash_size, &crash_base);
+ /* no crashkernel= or invalid value specified */
+ if (ret || !crash_size)
+ return;
+
+ crash_size = PAGE_ALIGN(crash_size);
+
+ if (crash_base == 0) {
+ /* Current arm64 boot protocol requires 2MB alignment */
+ crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
+ crash_size, SZ_2M);
+ if (crash_base == 0) {
+ pr_warn("Unable to allocate crashkernel (size:%llx)\n",
+ crash_size);
Nit: can we please make that "size: 0x%llx", so that it's always clearly
a hex number? e.g.
pr_warn("cannot allocate 0x%llx bytes for crashkernel\n",
crash_size);
OK
Post by Mark Rutland
Post by AKASHI Takahiro
+ return;
+ }
+ } else {
+ /* User specifies base address explicitly. */
+ if (!memblock_is_region_memory(crash_base, crash_size) ||
+ memblock_is_region_reserved(crash_base, crash_size)) {
+ pr_warn("crashkernel has wrong address or size\n");
+ return;
+ }
if (!memblock_is_region_memory(crash_base, crash_size)) {
pr_warn("cannot reserve crashkernel: region is not memory\n");
return;
}
if (!memblock_is_region_memory(crash_base, crash_size)) {
pr_warn("cannot reserve crashkernel: region overlaps reserved memory\n");
return;
}
OK, and
Post by Mark Rutland
Post by AKASHI Takahiro
+ if (!IS_ALIGNED(crash_base, SZ_2M)) {
+ pr_warn("crashkernel base address is not 2MB aligned\n");
pr_warn("cannot reserve crashkernel: base address is not 2MB aligned\n");
Post by Mark Rutland
Post by AKASHI Takahiro
+ return;
+ }
+ }
+ memblock_reserve(crash_base, crash_size);
+
+ pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
+ crash_size >> 20, crash_base >> 20);
We only page-align the size, so the MB will be a little off, but that's
probably OK. However, it would also be nicer to log the base as an
address.
You might notice that the exact same message is used by all the other
architectures, but
Post by Mark Rutland
Could we dump this as we do for the kernel memory layout? e.g.
pr_info("crashkernel reserved: 0x%016lx - 0x%016lx (%lld MB)\n",
crash_base, crash_base + crash_size, crash_size >> 20);
We can go either way.
Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
Mark Rutland
2017-02-02 11:26:20 UTC
Permalink
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+ pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
+ crash_size >> 20, crash_base >> 20);
We only page-align the size, so the MB will be a little off, but that's
probably OK. However, it would also be nicer to log the base as an
address.
You might notice that the exact same message is used by all the other
architectures, but
Almost all; I see arch/sh prints the address with %08x. ;)
Post by AKASHI Takahiro
Post by Mark Rutland
Could we dump this as we do for the kernel memory layout? e.g.
pr_info("crashkernel reserved: 0x%016lx - 0x%016lx (%lld MB)\n",
crash_base, crash_base + crash_size, crash_size >> 20);
We can go either way.
Even if it's different from other archtiectures, I'd prefer to log as
above, with the range in hex.

Thanks,
Mark.
AKASHI Takahiro
2017-02-02 13:44:24 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
+ pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
+ crash_size >> 20, crash_base >> 20);
We only page-align the size, so the MB will be a little off, but that's
probably OK. However, it would also be nicer to log the base as an
address.
You might notice that the exact same message is used by all the other
architectures, but
Almost all; I see arch/sh prints the address with %08x. ;)
Post by AKASHI Takahiro
Post by Mark Rutland
Could we dump this as we do for the kernel memory layout? e.g.
pr_info("crashkernel reserved: 0x%016lx - 0x%016lx (%lld MB)\n",
crash_base, crash_base + crash_size, crash_size >> 20);
We can go either way.
Even if it's different from other archtiectures, I'd prefer to log as
above, with the range in hex.
OK.

-Takhiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
AKASHI Takahiro
2017-02-01 12:46:29 UTC
Permalink
Kdump is enabled by default as kexec is.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/configs/defconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 33b744d54739..94c2ea523a8a 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -79,6 +79,7 @@ CONFIG_CMA=y
CONFIG_SECCOMP=y
CONFIG_XEN=y
CONFIG_KEXEC=y
+CONFIG_CRASH_DUMP=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_COMPAT=y
CONFIG_CPU_IDLE=y
--
2.11.0
AKASHI Takahiro
2017-02-01 12:46:28 UTC
Permalink
Add arch-specific functions to provide a dump file, /proc/vmcore.

This file is in ELF format and its ELF header needs to be prepared by
userspace tools, like kexec-tools, in adance. The primary kernel is
responsible to allocate the region with reserve_elfcorehdr() at boot time
and advertize its location to crash dump kernel via a new device-tree
property, "linux,elfcorehdr".

Then crash dump kernel will access the primary kernel's memory with
copy_oldmem_page(), which feeds the data page-by-page by ioremap'ing it
since it does not reside in linear mapping on crash dump kernel.

We also need our own elfcorehdr_read() here since the header is placed
within crash dump kernel's usable memory.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/Kconfig | 11 +++++++
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/crash_dump.c | 71 ++++++++++++++++++++++++++++++++++++++++++
arch/arm64/mm/init.c | 53 +++++++++++++++++++++++++++++++
4 files changed, 136 insertions(+)
create mode 100644 arch/arm64/kernel/crash_dump.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 111742126897..2bd6a1a062b9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -693,6 +693,17 @@ config KEXEC
but it is independent of the system firmware. And like a reboot
you can start any kernel with it, not just Linux.

+config CRASH_DUMP
+ bool "Build kdump crash kernel"
+ help
+ Generate crash dump after being started by kexec. This should
+ be normally only set in special crash dump kernels which are
+ loaded in the main kernel with kexec-tools into a specially
+ reserved region and then later executed after a crash by
+ kdump/kexec.
+
+ For more details see Documentation/kdump/kdump.txt
+
config XEN_DOM0
def_bool y
depends on XEN
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 7d66bbaafc0c..6a7384eee08d 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -50,6 +50,7 @@ arm64-obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
arm64-obj-$(CONFIG_HIBERNATION) += hibernate.o hibernate-asm.o
arm64-obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o \
cpu-reset.o
+arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o

obj-y += $(arm64-obj-y) vdso/ probes/
obj-m += $(arm64-obj-m)
diff --git a/arch/arm64/kernel/crash_dump.c b/arch/arm64/kernel/crash_dump.c
new file mode 100644
index 000000000000..c3d5a21c081e
--- /dev/null
+++ b/arch/arm64/kernel/crash_dump.c
@@ -0,0 +1,71 @@
+/*
+ * Routines for doing kexec-based kdump
+ *
+ * Copyright (C) 2014 Linaro Limited
+ * Author: AKASHI Takahiro <***@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/crash_dump.h>
+#include <linux/errno.h>
+#include <linux/io.h>
+#include <linux/memblock.h>
+#include <linux/uaccess.h>
+#include <asm/memory.h>
+
+/**
+ * copy_oldmem_page() - copy one page from old kernel memory
+ * @pfn: page frame number to be copied
+ * @buf: buffer where the copied page is placed
+ * @csize: number of bytes to copy
+ * @offset: offset in bytes into the page
+ * @userbuf: if set, @buf is in a user address space
+ *
+ * This function copies one page from old kernel memory into buffer pointed by
+ * @buf. If @buf is in userspace, set @userbuf to %1. Returns number of bytes
+ * copied or negative error in case of failure.
+ */
+ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
+ size_t csize, unsigned long offset,
+ int userbuf)
+{
+ void *vaddr;
+
+ if (!csize)
+ return 0;
+
+ vaddr = memremap(__pfn_to_phys(pfn), PAGE_SIZE, MEMREMAP_WB);
+ if (!vaddr)
+ return -ENOMEM;
+
+ if (userbuf) {
+ if (copy_to_user((char __user *)buf, vaddr + offset, csize)) {
+ memunmap(vaddr);
+ return -EFAULT;
+ }
+ } else {
+ memcpy(buf, vaddr + offset, csize);
+ }
+
+ memunmap(vaddr);
+
+ return csize;
+}
+
+/**
+ * elfcorehdr_read - read from ELF core header
+ * @buf: buffer where the data is placed
+ * @csize: number of bytes to read
+ * @ppos: address in the memory
+ *
+ * This function reads @count bytes from elf core header which exists
+ * on crash dump kernel's memory.
+ */
+ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
+{
+ memcpy(buf, phys_to_virt((phys_addr_t)*ppos), count);
+ return count;
+}
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 2aba75dc7720..323b87197e18 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -38,6 +38,7 @@
#include <linux/swiotlb.h>
#include <linux/vmalloc.h>
#include <linux/kexec.h>
+#include <linux/crash_dump.h>

#include <asm/boot.h>
#include <asm/fixmap.h>
@@ -135,6 +136,56 @@ static void __init reserve_crashkernel(void)
}
#endif /* CONFIG_KEXEC_CORE */

+#ifdef CONFIG_CRASH_DUMP
+static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ const __be32 *reg;
+ int len;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,elfcorehdr", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ elfcorehdr_addr = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ elfcorehdr_size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+/*
+ * reserve_elfcorehdr() - reserves memory for elf core header
+ *
+ * This function reserves elf core header given in "elfcorehdr=" kernel
+ * command line parameter. This region contains all the information about
+ * primary kernel's core image and is used by a dump capture kernel to
+ * access the system memory on primary kernel.
+ */
+static void __init reserve_elfcorehdr(void)
+{
+ of_scan_flat_dt(early_init_dt_scan_elfcorehdr, NULL);
+
+ if (!elfcorehdr_size)
+ return;
+
+ if (memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)) {
+ pr_warn("elfcorehdr is overlapped\n");
+ return;
+ }
+
+ memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
+
+ pr_info("Reserving %lldKB of memory at 0x%llx for elfcorehdr\n",
+ elfcorehdr_size >> 10, elfcorehdr_addr);
+}
+#else
+static void __init reserve_elfcorehdr(void)
+{
+}
+#endif /* CONFIG_CRASH_DUMP */
/*
* Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
* currently assumes that for memory starting above 4G, 32-bit devices will
@@ -393,6 +444,8 @@ void __init arm64_memblock_init(void)

reserve_crashkernel();

+ reserve_elfcorehdr();
+
dma_contiguous_reserve(arm64_dma_phys_limit);

memblock_allow_resize();
--
2.11.0
Mark Rutland
2017-02-01 19:21:22 UTC
Permalink
Post by AKASHI Takahiro
Add arch-specific functions to provide a dump file, /proc/vmcore.
This file is in ELF format and its ELF header needs to be prepared by
userspace tools, like kexec-tools, in adance. The primary kernel is
responsible to allocate the region with reserve_elfcorehdr() at boot time
and advertize its location to crash dump kernel via a new device-tree
property, "linux,elfcorehdr".
+static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ const __be32 *reg;
+ int len;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,elfcorehdr", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ elfcorehdr_addr = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ elfcorehdr_size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+/*
+ * reserve_elfcorehdr() - reserves memory for elf core header
+ *
+ * This function reserves elf core header given in "elfcorehdr=" kernel
+ * command line parameter. This region contains all the information about
+ * primary kernel's core image and is used by a dump capture kernel to
+ * access the system memory on primary kernel.
+ */
+static void __init reserve_elfcorehdr(void)
+{
+ of_scan_flat_dt(early_init_dt_scan_elfcorehdr, NULL);
+
+ if (!elfcorehdr_size)
+ return;
+
+ if (memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)) {
+ pr_warn("elfcorehdr is overlapped\n");
+ return;
+ }
+
+ memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
+
+ pr_info("Reserving %lldKB of memory at 0x%llx for elfcorehdr\n",
+ elfcorehdr_size >> 10, elfcorehdr_addr);
+}
This doesn't seem right to me. The logic here doesn't match the commit
message, the comment above reserve_elfcorehdr() doesn't match the
implementation, and this doesn't match my understanding of how this was
intended to be used from the DT binding.

I had assumed that we'd treat this in much the same way as the
linux,reserved-memory-region property, with the primary kernel either
dynamically allocating the region or using a command line option, and
the base being exposed to userspace via /sys/ or /proc/ somehow.

Why is that not the case?

Thanks,
Mark.
AKASHI Takahiro
2017-02-02 06:24:09 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
Add arch-specific functions to provide a dump file, /proc/vmcore.
This file is in ELF format and its ELF header needs to be prepared by
userspace tools, like kexec-tools, in adance. The primary kernel is
responsible to allocate the region with reserve_elfcorehdr() at boot time
and advertize its location to crash dump kernel via a new device-tree
property, "linux,elfcorehdr".
+static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ const __be32 *reg;
+ int len;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,elfcorehdr", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ elfcorehdr_addr = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ elfcorehdr_size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+/*
+ * reserve_elfcorehdr() - reserves memory for elf core header
+ *
+ * This function reserves elf core header given in "elfcorehdr=" kernel
+ * command line parameter. This region contains all the information about
+ * primary kernel's core image and is used by a dump capture kernel to
+ * access the system memory on primary kernel.
+ */
+static void __init reserve_elfcorehdr(void)
+{
+ of_scan_flat_dt(early_init_dt_scan_elfcorehdr, NULL);
+
+ if (!elfcorehdr_size)
+ return;
+
+ if (memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)) {
+ pr_warn("elfcorehdr is overlapped\n");
+ return;
+ }
+
+ memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
+
+ pr_info("Reserving %lldKB of memory at 0x%llx for elfcorehdr\n",
+ elfcorehdr_size >> 10, elfcorehdr_addr);
+}
This doesn't seem right to me. The logic here doesn't match the commit
message, the comment above reserve_elfcorehdr() doesn't match the
implementation, and this doesn't match my understanding of how this was
intended to be used from the DT binding.
Surely the commit message was wrong/misleading; It should say

===8<===
Arch-specific functions are added to allow for implementing a crash dump
file interface, /proc/vmcore, which can be viewed as a ELF file.

A user space tool, like kexec-tools, is responsible for allocating
a separate region for the core's ELF header within crash kdump kernel
memory and filling it in when executing kexec_load().

Then, its location will be advertised to crash dump kernel via a new
device-tree property, "linux,elfcorehdr", and crash dump kernel preserves
the region for later use with reserve_elfcorehdr() at boot time.

On crash dump kernel, /proc/vmcore will access the primary kernel's memory
with copy_oldmem_page(), which feeds the data page-by-page by ioremap'ing
it since it does not reside in linear mapping on crash dump kernel.

Meanwhile, elfcorehdr_read() is simple as the region is always mapped.
===>8===

Does this make things clear?
Post by Mark Rutland
I had assumed that we'd treat this in much the same way as the
linux,reserved-memory-region property, with the primary kernel either
dynamically allocating the region or using a command line option, and
the base being exposed to userspace via /sys/ or /proc/ somehow.
I didn't get the point here, but please note that the data in ELF core header
is produced by kexec-tools (who knows its location, too), and consumed solely
by the crash dump kernel.

Thanks,
-Takahiro AKASHI
Post by Mark Rutland
Why is that not the case?
Thanks,
Mark.
Mark Rutland
2017-02-02 12:03:20 UTC
Permalink
Post by AKASHI Takahiro
Post by Mark Rutland
Post by AKASHI Takahiro
Add arch-specific functions to provide a dump file, /proc/vmcore.
This file is in ELF format and its ELF header needs to be prepared by
userspace tools, like kexec-tools, in adance. The primary kernel is
responsible to allocate the region with reserve_elfcorehdr() at boot time
and advertize its location to crash dump kernel via a new device-tree
property, "linux,elfcorehdr".
+static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ const __be32 *reg;
+ int len;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,elfcorehdr", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ elfcorehdr_addr = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ elfcorehdr_size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+/*
+ * reserve_elfcorehdr() - reserves memory for elf core header
+ *
+ * This function reserves elf core header given in "elfcorehdr=" kernel
+ * command line parameter. This region contains all the information about
+ * primary kernel's core image and is used by a dump capture kernel to
+ * access the system memory on primary kernel.
+ */
+static void __init reserve_elfcorehdr(void)
+{
+ of_scan_flat_dt(early_init_dt_scan_elfcorehdr, NULL);
+
+ if (!elfcorehdr_size)
+ return;
+
+ if (memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)) {
+ pr_warn("elfcorehdr is overlapped\n");
+ return;
+ }
+
+ memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
+
+ pr_info("Reserving %lldKB of memory at 0x%llx for elfcorehdr\n",
+ elfcorehdr_size >> 10, elfcorehdr_addr);
+}
This doesn't seem right to me. The logic here doesn't match the commit
message, the comment above reserve_elfcorehdr() doesn't match the
implementation, and this doesn't match my understanding of how this was
intended to be used from the DT binding.
Surely the commit message was wrong/misleading; It should say
===8<===
Arch-specific functions are added to allow for implementing a crash dump
file interface, /proc/vmcore, which can be viewed as a ELF file.
A user space tool, like kexec-tools, is responsible for allocating
a separate region for the core's ELF header within crash kdump kernel
memory and filling it in when executing kexec_load().
Then, its location will be advertised to crash dump kernel via a new
device-tree property, "linux,elfcorehdr", and crash dump kernel preserves
the region for later use with reserve_elfcorehdr() at boot time.
On crash dump kernel, /proc/vmcore will access the primary kernel's memory
with copy_oldmem_page(), which feeds the data page-by-page by ioremap'ing
it since it does not reside in linear mapping on crash dump kernel.
Meanwhile, elfcorehdr_read() is simple as the region is always mapped.
===>8===
Does this make things clear?
Yes. That sounds roughly like what I was expecting. Thanks for
clarifiying that!

Can you also fix the comment above reserve_elfcorehdr()? It refers to a
non-existent kernel command line option.

... at the same time, can we change the print there to something like:

pr_info("elfcorehdr found and reserved at 0x%016llx - 0x%016llx (%lld KB)\n",
elfcorehdr_addr, elfcorehdr_addr + elfcorehdr_size,
elfcorehdr_size >> 10);

... that makes it clear that we've found an existing elfcorehdr, rather
than we're reserving space to later fill with an elfcorehdr, and
printing the rgion base and size in that way aligns with what we do
elsewhere.

With all that:

Acked-by: Mark Rutland <***@arm.com>

Thanks,
Mark.
Mark Rutland
2017-02-02 12:08:50 UTC
Permalink
Post by Mark Rutland
Post by AKASHI Takahiro
+/*
+ * reserve_elfcorehdr() - reserves memory for elf core header
+ *
+ * This function reserves elf core header given in "elfcorehdr=" kernel
+ * command line parameter. This region contains all the information about
+ * primary kernel's core image and is used by a dump capture kernel to
+ * access the system memory on primary kernel.
+ */
Can you also fix the comment above reserve_elfcorehdr()? It refers to a
non-existent kernel command line option.
... or rather, one that's handled by core code, not
reserve_elfcorehdr().

So more specifically, it would be nicer to have something like:

/*
* reserve_elfcorehdr() - reserves memory for elf core header
*
* This function reserves the memory occupied by an elf core header
* described in the device tree. This region contains all the
* information about primary kernel's core image and is used by a dump
* capture kernel to access the system memory on primary kernel.
*/

Thanks,
Mark.
AKASHI Takahiro
2017-02-02 14:39:58 UTC
Permalink
Post by Mark Rutland
Post by Mark Rutland
Post by AKASHI Takahiro
+/*
+ * reserve_elfcorehdr() - reserves memory for elf core header
+ *
+ * This function reserves elf core header given in "elfcorehdr=" kernel
+ * command line parameter. This region contains all the information about
+ * primary kernel's core image and is used by a dump capture kernel to
+ * access the system memory on primary kernel.
+ */
Can you also fix the comment above reserve_elfcorehdr()? It refers to a
non-existent kernel command line option.
... or rather, one that's handled by core code, not
reserve_elfcorehdr().
/*
* reserve_elfcorehdr() - reserves memory for elf core header
*
* This function reserves the memory occupied by an elf core header
* described in the device tree. This region contains all the
* information about primary kernel's core image and is used by a dump
* capture kernel to access the system memory on primary kernel.
*/
Perfect!

Thanks a lot,
-Takahiro AKASHI
Post by Mark Rutland
Thanks,
Mark.
AKASHI Takahiro
2017-02-01 12:48:20 UTC
Permalink
From: James Morse <***@arm.com>

Add documentation for DT properties:
linux,usable-memory-range
linux,elfcorehdr
used by arm64 kdump. Those decribe the usable memory range for crash dump
kernel and the elfcorehdr's location within it, respectively.

Signed-off-by: James Morse <***@arm.com>
[***@linaro.org: added "linux,crashkernel-base" and "-size" ]
Signed-off-by: AKASHI Takahiro <***@linaro.org>
Acked-by: Mark Rutland <***@arm.com>
Cc: ***@vger.kernel.org
Cc: Rob Herring <robh+***@kernel.org>
---
Documentation/devicetree/bindings/chosen.txt | 37 ++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/Documentation/devicetree/bindings/chosen.txt b/Documentation/devicetree/bindings/chosen.txt
index 6ae9d82d4c37..8dc82431acc1 100644
--- a/Documentation/devicetree/bindings/chosen.txt
+++ b/Documentation/devicetree/bindings/chosen.txt
@@ -52,3 +52,40 @@ This property is set (currently only on PowerPC, and only needed on
book3e) by some versions of kexec-tools to tell the new kernel that it
is being booted by kexec, as the booting environment may differ (e.g.
a different secondary CPU release mechanism)
+
+linux,usable-memory-range
+-------------------------
+
+This property (arm64 only) holds a base address and size, describing a
+limited region in which memory may be considered available for use by
+the kernel. Memory outside of this range is not available for use.
+
+This property describes a limitation: memory within this range is only
+valid when also described through another mechanism that the kernel
+would otherwise use to determine available memory (e.g. memory nodes
+or the EFI memory map). Valid memory may be sparse within the range.
+e.g.
+
+/ {
+ chosen {
+ linux,usable-memory-range = <0x9 0xf0000000 0x0 0x10000000>;
+ };
+};
+
+The main usage is for crash dump kernel to identify its own usable
+memory and exclude, at its boot time, any other memory areas that are
+part of the panicked kernel's memory.
+
+linux,elfcorehdr
+----------------
+
+This property (currently used only on arm64) holds the memory range,
+the address and the size, of the elf core header which mainly describes
+the panicked kernel's memory layout as PT_LOAD segments of elf format.
+e.g.
+
+/ {
+ chosen {
+ linux,elfcorehdr = <0x9 0xfffff000 0x0 0x800>;
+ };
+};
--
2.11.0
AKASHI Takahiro
2017-02-01 12:46:30 UTC
Permalink
Add arch specific descriptions about kdump usage on arm64 to kdump.txt.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Baoquan He <***@redhat.com>
Acked-by: Dave Young <***@redhat.com>
Acked-by: Catalin Marinas <***@arm.com>
---
Documentation/kdump/kdump.txt | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index b0eb27b956d9..615434d81108 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
a remote system.

Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
-s390x and arm architectures.
+s390x, arm and arm64 architectures.

When the system kernel boots, it reserves a small section of memory for
the dump-capture kernel. This ensures that ongoing Direct Memory Access
@@ -249,6 +249,13 @@ Dump-capture kernel config options (Arch Dependent, arm)

AUTO_ZRELADDR=y

+Dump-capture kernel config options (Arch Dependent, arm64)
+----------------------------------------------------------
+
+- Please note that kvm of the dump-capture kernel will not be enabled
+ on non-VHE systems even if it is configured. This is because the CPU
+ will not be reset to EL2 on panic.
+
Extended crashkernel syntax
===========================

@@ -305,6 +312,8 @@ Boot into System Kernel
kernel will automatically locate the crash kernel image within the
first 512MB of RAM if X is not given.

+ On arm64, use "crashkernel=Y[@X]". Note that the start address of
+ the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).

Load the Dump-capture Kernel
============================
@@ -327,6 +336,8 @@ For s390x:
- Use image or bzImage
For arm:
- Use zImage
+For arm64:
+ - Use vmlinux or Image

If you are using a uncompressed vmlinux image then use following command
to load dump-capture kernel.
@@ -370,6 +381,9 @@ For s390x:
For arm:
"1 maxcpus=1 reset_devices"

+For arm64:
+ "1 maxcpus=1 reset_devices"
+
Notes on loading the dump-capture kernel:

* By default, the ELF headers are stored in ELF64 format to support
--
2.11.0
Loading...