Discussion:
[PATCH v30 00/11] arm64: add kdump support
AKASHI Takahiro
2017-01-24 08:46:38 UTC
Permalink
This patch series adds kdump support on arm64.

To load a crash-dump kernel to the systems, a series of patches to
kexec-tools[1] are also needed. Please use the latest one, v5 [2].
For your convinience, you can pick them up from:
https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump

To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
- crash utility (v7.1.8 or later, i.e. master for now) [3]

I tested this patchset on fast model and hikey.
The previous version, v29, was also:
Tested-by: Pratyush Anand <***@redhat.com> (mustang and seattle)

Changes for v30 (Jan 24, 2017)
o rebased to Linux-v4.10-rc5
o remove "linux,crashkernel-base/size" from exported device tree
o protect memory region for crash-dump kernel (adding patch#4,5)
o remove "in_crash_kexec" variable
o and other trivial changes

Changes for v29 (Dec 28, 2016)
o rebased to Linux-v4.10-rc1
o change asm constraints in crash_setup_regs() per Catalin

Changes for v28 (Nov 22, 2016)
o rebased to Linux-v4.9-rc6
o revamp patch #1 and merge memblock_cap_memory_range() with
memblock_mem_limit_remove_map()

Changes for v27 (Nov 1, 2016)
o rebased to Linux-v4.9-rc3
o revert v26 change, i.e. revive "linux,usable-memory-range" property
(patch #2/#3, updating patch #9)
o minor fixes per review comments (patch #3/#4/#6/#8)
o re-order patches and improve commit messages for readability

Changes for v26 (Sep 7, 2016):
o Use /reserved-memory instead of "linux,usable-memory-range" property
(dropping v25's patch#2 and #3, updating ex-patch#9.)

Changes for v25 (Aug 29, 2016):
o Rebase to Linux-4.8-rc4
o Use memremap() instead of ioremap_cache() [patch#5]

Changes for v24 (Aug 9, 2016):
o Rebase to Linux-4.8-rc1
o Update descriptions about newly added DT proerties

Changes for v23 (July 26, 2016):

o Move memblock_reserve() to a single place in reserve_crashkernel()
o Use cpu_park_loop() in ipi_cpu_crash_stop()
o Always enforce ARCH_LOW_ADDRESS_LIMIT to the memory range of crash kernel
o Re-implement fdt_enforce_memory_region() to remove non-reserve regions
(for ACPI) from usable memory at crash kernel

Changes for v22 (July 12, 2016):

o Export "crashkernel-base" and "crashkernel-size" via device-tree,
and add some descriptions about them in chosen.txt
o Rename "usable-memory" to "usable-memory-range" to avoid inconsistency
with powerpc's "usable-memory"
o Make cosmetic changes regarding "ifdef" usage
o Correct some wordings in kdump.txt

Changes for v21 (July 6, 2016):

o Remove kexec patches.
o Rebase to arm64's for-next/core (Linux-4.7-rc4 based).
o Clarify the description about kvm in kdump.txt.

See the link [4] for older changes.


[1] https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
[2] http://lists.infradead.org/pipermail/kexec/2017-January/018002.html
[3] https://github.com/crash-utility/crash.git
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-June/438780.html

AKASHI Takahiro (10):
memblock: add memblock_cap_memory_range()
arm64: limit memory regions based on DT property, usable-memory-range
arm64: kdump: reserve memory for crash dump kernel
arm64: mm: allow for unmapping memory region from kernel mapping
arm64: kdump: protect crash dump kernel memory
arm64: kdump: implement machine_crash_shutdown()
arm64: kdump: add VMCOREINFO's for user-space tools
arm64: kdump: provide /proc/vmcore file
arm64: kdump: enable kdump in defconfig
Documentation: kdump: describe arm64 port

James Morse (1):
Documentation: dt: chosen properties for arm64 kdump

Documentation/devicetree/bindings/chosen.txt | 37 +++++++
Documentation/kdump/kdump.txt | 16 ++-
arch/arm64/Kconfig | 11 ++
arch/arm64/configs/defconfig | 1 +
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 42 +++++++-
arch/arm64/include/asm/mmu.h | 2 +
arch/arm64/include/asm/pgtable-hwdef.h | 2 +
arch/arm64/include/asm/pgtable-prot.h | 1 +
arch/arm64/include/asm/pgtable.h | 4 +
arch/arm64/include/asm/smp.h | 2 +
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/crash_dump.c | 71 +++++++++++++
arch/arm64/kernel/machine_kexec.c | 132 ++++++++++++++++++-----
arch/arm64/kernel/setup.c | 7 +-
arch/arm64/kernel/smp.c | 63 +++++++++++
arch/arm64/mm/init.c | 150 +++++++++++++++++++++++++++
arch/arm64/mm/mmu.c | 63 +++++++++--
include/linux/memblock.h | 1 +
mm/memblock.c | 44 +++++---
20 files changed, 596 insertions(+), 56 deletions(-)
create mode 100644 arch/arm64/kernel/crash_dump.c
--
2.11.0
AKASHI Takahiro
2017-01-24 08:49:07 UTC
Permalink
Add memblock_cap_memory_range() which will remove all the memblock regions
except the memory range specified in the arguments. In addition, rework is
done on memblock_mem_limit_remove_map() to re-implement it using
memblock_cap_memory_range().

This function, like memblock_mem_limit_remove_map(), will not remove
memblocks with MEMMAP_NOMAP attribute as they may be mapped and accessed
later as "device memory."
See the commit a571d4eb55d8 ("mm/memblock.c: add new infrastructure to
address the mem limit issue").

This function is used, in a succeeding patch in the series of arm64 kdump
suuport, to limit the range of usable memory, or System RAM, on crash dump
kernel.
(Please note that "mem=" parameter is of little use for this purpose.)

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Will Deacon <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
Acked-by: Dennis Chen <***@arm.com>
Cc: linux-***@kvack.org
Cc: Andrew Morton <***@linux-foundation.org>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 44 +++++++++++++++++++++++++++++---------------
2 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5b759c9acf97..fbfcacc50c29 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -333,6 +333,7 @@ phys_addr_t memblock_mem_size(unsigned long limit_pfn);
phys_addr_t memblock_start_of_DRAM(void);
phys_addr_t memblock_end_of_DRAM(void);
void memblock_enforce_memory_limit(phys_addr_t memory_limit);
+void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
void memblock_mem_limit_remove_map(phys_addr_t limit);
bool memblock_is_memory(phys_addr_t addr);
int memblock_is_map_memory(phys_addr_t addr);
diff --git a/mm/memblock.c b/mm/memblock.c
index 7608bc305936..fea1688fef60 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1514,11 +1514,37 @@ void __init memblock_enforce_memory_limit(phys_addr_t limit)
(phys_addr_t)ULLONG_MAX);
}

+void __init memblock_cap_memory_range(phys_addr_t base, phys_addr_t size)
+{
+ int start_rgn, end_rgn;
+ int i, ret;
+
+ if (!size)
+ return;
+
+ ret = memblock_isolate_range(&memblock.memory, base, size,
+ &start_rgn, &end_rgn);
+ if (ret)
+ return;
+
+ /* remove all the MAP regions */
+ for (i = memblock.memory.cnt - 1; i >= end_rgn; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ for (i = start_rgn - 1; i >= 0; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ /* truncate the reserved regions */
+ memblock_remove_range(&memblock.reserved, 0, base);
+ memblock_remove_range(&memblock.reserved,
+ base + size, (phys_addr_t)ULLONG_MAX);
+}
+
void __init memblock_mem_limit_remove_map(phys_addr_t limit)
{
- struct memblock_type *type = &memblock.memory;
phys_addr_t max_addr;
- int i, ret, start_rgn, end_rgn;

if (!limit)
return;
@@ -1529,19 +1555,7 @@ void __init memblock_mem_limit_remove_map(phys_addr_t limit)
if (max_addr == (phys_addr_t)ULLONG_MAX)
return;

- ret = memblock_isolate_range(type, max_addr, (phys_addr_t)ULLONG_MAX,
- &start_rgn, &end_rgn);
- if (ret)
- return;
-
- /* remove all the MAP regions above the limit */
- for (i = end_rgn - 1; i >= start_rgn; i--) {
- if (!memblock_is_nomap(&type->regions[i]))
- memblock_remove_region(type, i);
- }
- /* truncate the reserved regions */
- memblock_remove_range(&memblock.reserved, max_addr,
- (phys_addr_t)ULLONG_MAX);
+ memblock_cap_memory_range(0, max_addr);
}

static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
--
2.11.0
AKASHI Takahiro
2017-01-24 08:49:56 UTC
Permalink
Crash dump kernel utilizes only a subset of available memory as System RAM.
On arm64 kdump, This memory range is advertized to crash dump kernel via
a device-tree property under /chosen,
linux,usable-memory-range = <BASE SIZE>

Crash dump kernel reads this property at boot time and calls
memblock_cap_memory_range() to limit usable memory ranges which are
described as entries in UEFI memory map table or "memory" nodes in
a device tree blob.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Geoff Levand <***@infradead.org>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/mm/init.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 380ebe705093..6cddb566eb21 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -187,10 +187,45 @@ static int __init early_mem(char *p)
}
early_param("mem", early_mem);

+static int __init early_init_dt_scan_usablemem(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ struct memblock_region *usablemem = (struct memblock_region *)data;
+ const __be32 *reg;
+ int len;
+
+ usablemem->size = 0;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,usable-memory-range", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ usablemem->base = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ usablemem->size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+static void __init fdt_enforce_memory_region(void)
+{
+ struct memblock_region reg;
+
+ of_scan_flat_dt(early_init_dt_scan_usablemem, &reg);
+
+ if (reg.size)
+ memblock_cap_memory_range(reg.base, reg.size);
+}
+
void __init arm64_memblock_init(void)
{
const s64 linear_region_size = -(s64)PAGE_OFFSET;

+ /* Handle linux,usable-memory-range property */
+ fdt_enforce_memory_region();
+
/*
* Ensure that the linear region takes up exactly half of the kernel
* virtual address space. This way, we can distinguish a linear address
--
2.11.0
AKASHI Takahiro
2017-01-24 08:49:57 UTC
Permalink
"crashkernel=" kernel parameter specifies the size (and optionally
the start address) of the system ram used by crash dump kernel.
reserve_crashkernel() will allocate and reserve the memory at the startup
of primary kernel.

This memory range will be exported to userspace via an entry named
"Crash kernel" in /proc/iomem.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Signed-off-by: Mark Salter <***@redhat.com>
Signed-off-by: Pratyush Anand <***@redhat.com>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/kernel/setup.c | 7 +++++-
arch/arm64/mm/init.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index b051367e2149..515e9c6696df 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -31,7 +31,6 @@
#include <linux/screen_info.h>
#include <linux/init.h>
#include <linux/kexec.h>
-#include <linux/crash_dump.h>
#include <linux/root_dev.h>
#include <linux/cpu.h>
#include <linux/interrupt.h>
@@ -224,6 +223,12 @@ static void __init request_standard_resources(void)
if (kernel_data.start >= res->start &&
kernel_data.end <= res->end)
request_resource(res, &kernel_data);
+#ifdef CONFIG_KEXEC_CORE
+ /* Userspace will find "Crash kernel" region in /proc/iomem. */
+ if (crashk_res.end && crashk_res.start >= res->start &&
+ crashk_res.end <= res->end)
+ request_resource(res, &crashk_res);
+#endif
}
}

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 6cddb566eb21..2aba75dc7720 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -30,12 +30,14 @@
#include <linux/gfp.h>
#include <linux/memblock.h>
#include <linux/sort.h>
+#include <linux/of.h>
#include <linux/of_fdt.h>
#include <linux/dma-mapping.h>
#include <linux/dma-contiguous.h>
#include <linux/efi.h>
#include <linux/swiotlb.h>
#include <linux/vmalloc.h>
+#include <linux/kexec.h>

#include <asm/boot.h>
#include <asm/fixmap.h>
@@ -76,6 +78,63 @@ static int __init early_initrd(char *p)
early_param("initrd", early_initrd);
#endif

+#ifdef CONFIG_KEXEC_CORE
+/*
+ * reserve_crashkernel() - reserves memory for crash kernel
+ *
+ * This function reserves memory area given in "crashkernel=" kernel command
+ * line parameter. The memory reserved is used by dump capture kernel when
+ * primary kernel is crashing.
+ */
+static void __init reserve_crashkernel(void)
+{
+ unsigned long long crash_base, crash_size;
+ int ret;
+
+ ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+ &crash_size, &crash_base);
+ /* no crashkernel= or invalid value specified */
+ if (ret || !crash_size)
+ return;
+
+ crash_size = PAGE_ALIGN(crash_size);
+
+ if (crash_base == 0) {
+ /* Current arm64 boot protocol requires 2MB alignment */
+ crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
+ crash_size, SZ_2M);
+ if (crash_base == 0) {
+ pr_warn("Unable to allocate crashkernel (size:%llx)\n",
+ crash_size);
+ return;
+ }
+ } else {
+ /* User specifies base address explicitly. */
+ if (!memblock_is_region_memory(crash_base, crash_size) ||
+ memblock_is_region_reserved(crash_base, crash_size)) {
+ pr_warn("crashkernel has wrong address or size\n");
+ return;
+ }
+
+ if (!IS_ALIGNED(crash_base, SZ_2M)) {
+ pr_warn("crashkernel base address is not 2MB aligned\n");
+ return;
+ }
+ }
+ memblock_reserve(crash_base, crash_size);
+
+ pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
+ crash_size >> 20, crash_base >> 20);
+
+ crashk_res.start = crash_base;
+ crashk_res.end = crash_base + crash_size - 1;
+}
+#else
+static void __init reserve_crashkernel(void)
+{
+}
+#endif /* CONFIG_KEXEC_CORE */
+
/*
* Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
* currently assumes that for memory starting above 4G, 32-bit devices will
@@ -331,6 +390,9 @@ void __init arm64_memblock_init(void)
arm64_dma_phys_limit = max_zone_dma_phys();
else
arm64_dma_phys_limit = PHYS_MASK + 1;
+
+ reserve_crashkernel();
+
dma_contiguous_reserve(arm64_dma_phys_limit);

memblock_allow_resize();
--
2.11.0
AKASHI Takahiro
2017-01-24 08:49:58 UTC
Permalink
The current implementation of create_mapping_late() is only allowed
to modify permission attributes (read-only or read-write) against
the existing kernel mapping.

In this patch, PAGE_KERNEL_INVALID protection attribute is introduced.
We will now be able to invalidate (or unmap) some part of the existing
kernel mapping by specifying PAGE_KERNEL_INVALID to create_mapping_late().

This feature will be used in a suceeding kdump patch to protect
the memory reserved for crash dump kernel once after loaded.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
---
arch/arm64/include/asm/mmu.h | 2 ++
arch/arm64/include/asm/pgtable-hwdef.h | 2 ++
arch/arm64/include/asm/pgtable-prot.h | 1 +
arch/arm64/include/asm/pgtable.h | 4 ++++
arch/arm64/mm/mmu.c | 29 ++++++++++++++++++++---------
5 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 47619411f0ff..a6c1367527bc 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -36,6 +36,8 @@ extern void init_mem_pgprot(void);
extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
unsigned long virt, phys_addr_t size,
pgprot_t prot, bool page_mappings_only);
+extern void create_mapping_late(phys_addr_t phys, unsigned long virt,
+ phys_addr_t size, pgprot_t prot);
extern void *fixmap_remap_fdt(phys_addr_t dt_phys);

#endif
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index eb0c2bd90de9..e66efec31ca9 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -119,6 +119,7 @@
#define PUD_TABLE_BIT (_AT(pgdval_t, 1) << 1)
#define PUD_TYPE_MASK (_AT(pgdval_t, 3) << 0)
#define PUD_TYPE_SECT (_AT(pgdval_t, 1) << 0)
+#define PUD_VALID PUD_TYPE_SECT

/*
* Level 2 descriptor (PMD).
@@ -128,6 +129,7 @@
#define PMD_TYPE_TABLE (_AT(pmdval_t, 3) << 0)
#define PMD_TYPE_SECT (_AT(pmdval_t, 1) << 0)
#define PMD_TABLE_BIT (_AT(pmdval_t, 1) << 1)
+#define PMD_VALID PMD_TYPE_SECT

/*
* Section
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 2142c7726e76..945d84cd5df7 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -54,6 +54,7 @@
#define PAGE_KERNEL_ROX __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_RDONLY)
#define PAGE_KERNEL_EXEC __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE)
#define PAGE_KERNEL_EXEC_CONT __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_CONT)
+#define PAGE_KERNEL_INVALID __pgprot(0)

#define PAGE_HYP __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_HYP_XN)
#define PAGE_HYP_EXEC __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ffbb9a520563..1904a7c07018 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -364,6 +364,8 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,

#define pmd_bad(pmd) (!(pmd_val(pmd) & PMD_TABLE_BIT))

+#define pmd_valid(pmd) (!!(pmd_val(pmd) & PMD_VALID))
+
#define pmd_table(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
PMD_TYPE_TABLE)
#define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
@@ -428,6 +430,7 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)

#define pud_none(pud) (!pud_val(pud))
#define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
+#define pud_valid(pud) (!!(pud_val(pud) & PUD_VALID))
#define pud_present(pud) (pud_val(pud))

static inline void set_pud(pud_t *pudp, pud_t pud)
@@ -481,6 +484,7 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)

#define pgd_none(pgd) (!pgd_val(pgd))
#define pgd_bad(pgd) (!(pgd_val(pgd) & 2))
+#define pgd_valid(pgd) (!!(pgd_val(pgd) & 1))
#define pgd_present(pgd) (pgd_val(pgd))

static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..9c7adcce8e4e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -133,7 +133,8 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
* Set the contiguous bit for the subsequent group of PTEs if
* its size and alignment are appropriate.
*/
- if (((addr | PFN_PHYS(pfn)) & ~CONT_PTE_MASK) == 0) {
+ if ((pgprot_val(prot) & PTE_VALID) &&
+ (((addr | PFN_PHYS(pfn)) & ~CONT_PTE_MASK) == 0)) {
if (end - addr >= CONT_PTE_SIZE && !page_mappings_only)
__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
else
@@ -147,7 +148,8 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
* After the PTE entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pte_val(old_pte), pte_val(*pte)));
+ BUG_ON(pte_valid(old_pte) && pte_valid(*pte) &&
+ !pgattr_change_is_safe(pte_val(old_pte), pte_val(*pte)));

} while (pte++, addr += PAGE_SIZE, addr != end);

@@ -190,7 +192,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
* Set the contiguous bit for the subsequent group of
* PMDs if its size and alignment are appropriate.
*/
- if (((addr | phys) & ~CONT_PMD_MASK) == 0) {
+ if ((pgprot_val(prot) | PMD_VALID) &&
+ ((addr | phys) & ~CONT_PMD_MASK) == 0) {
if (end - addr >= CONT_PMD_SIZE)
__prot = __pgprot(pgprot_val(prot) |
PTE_CONT);
@@ -203,7 +206,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
* After the PMD entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
+ BUG_ON(pmd_valid(old_pmd) && pmd_valid(*pmd) &&
+ !pgattr_change_is_safe(pmd_val(old_pmd),
pmd_val(*pmd)));
} else {
alloc_init_pte(pmd, addr, next, __phys_to_pfn(phys),
@@ -263,7 +267,8 @@ static void alloc_init_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
* After the PUD entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
+ BUG_ON(pud_valid(old_pud) && pud_valid(*pud) &&
+ !pgattr_change_is_safe(pud_val(old_pud),
pud_val(*pud)));
} else {
alloc_init_pmd(pud, addr, next, phys, prot,
@@ -344,8 +349,8 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
pgd_pgtable_alloc, page_mappings_only);
}

-static void create_mapping_late(phys_addr_t phys, unsigned long virt,
- phys_addr_t size, pgprot_t prot)
+void create_mapping_late(phys_addr_t phys, unsigned long virt,
+ phys_addr_t size, pgprot_t prot)
{
if (virt < VMALLOC_START) {
pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
@@ -791,14 +796,20 @@ int __init arch_ioremap_pmd_supported(void)
int pud_set_huge(pud_t *pud, phys_addr_t phys, pgprot_t prot)
{
BUG_ON(phys & ~PUD_MASK);
- set_pud(pud, __pud(phys | PUD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+ set_pud(pud, __pud(phys |
+ ((pgprot_val(prot) & PUD_VALID) ?
+ PUD_TYPE_SECT : 0) |
+ pgprot_val(mk_sect_prot(prot))));
return 1;
}

int pmd_set_huge(pmd_t *pmd, phys_addr_t phys, pgprot_t prot)
{
BUG_ON(phys & ~PMD_MASK);
- set_pmd(pmd, __pmd(phys | PMD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+ set_pmd(pmd, __pmd(phys |
+ ((pgprot_val(prot) & PMD_VALID) ?
+ PMD_TYPE_SECT : 0) |
+ pgprot_val(mk_sect_prot(prot))));
return 1;
}
--
2.11.0
Pratyush Anand
2017-01-24 11:32:20 UTC
Permalink
Post by AKASHI Takahiro
The current implementation of create_mapping_late() is only allowed
to modify permission attributes (read-only or read-write) against
the existing kernel mapping.
In this patch, PAGE_KERNEL_INVALID protection attribute is introduced.
We will now be able to invalidate (or unmap) some part of the existing
kernel mapping by specifying PAGE_KERNEL_INVALID to create_mapping_late().
This feature will be used in a suceeding kdump patch to protect
the memory reserved for crash dump kernel once after loaded.
---
arch/arm64/include/asm/mmu.h | 2 ++
arch/arm64/include/asm/pgtable-hwdef.h | 2 ++
arch/arm64/include/asm/pgtable-prot.h | 1 +
arch/arm64/include/asm/pgtable.h | 4 ++++
arch/arm64/mm/mmu.c | 29 ++++++++++++++++++++---------
5 files changed, 29 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 47619411f0ff..a6c1367527bc 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -36,6 +36,8 @@ extern void init_mem_pgprot(void);
extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
unsigned long virt, phys_addr_t size,
pgprot_t prot, bool page_mappings_only);
+extern void create_mapping_late(phys_addr_t phys, unsigned long virt,
+ phys_addr_t size, pgprot_t prot);
extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
#endif
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index eb0c2bd90de9..e66efec31ca9 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -119,6 +119,7 @@
#define PUD_TABLE_BIT (_AT(pgdval_t, 1) << 1)
#define PUD_TYPE_MASK (_AT(pgdval_t, 3) << 0)
#define PUD_TYPE_SECT (_AT(pgdval_t, 1) << 0)
+#define PUD_VALID PUD_TYPE_SECT
/*
* Level 2 descriptor (PMD).
@@ -128,6 +129,7 @@
#define PMD_TYPE_TABLE (_AT(pmdval_t, 3) << 0)
#define PMD_TYPE_SECT (_AT(pmdval_t, 1) << 0)
#define PMD_TABLE_BIT (_AT(pmdval_t, 1) << 1)
+#define PMD_VALID PMD_TYPE_SECT
/*
* Section
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 2142c7726e76..945d84cd5df7 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -54,6 +54,7 @@
#define PAGE_KERNEL_ROX __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_RDONLY)
#define PAGE_KERNEL_EXEC __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE)
#define PAGE_KERNEL_EXEC_CONT __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_CONT)
+#define PAGE_KERNEL_INVALID __pgprot(0)
#define PAGE_HYP __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_HYP_XN)
#define PAGE_HYP_EXEC __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ffbb9a520563..1904a7c07018 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -364,6 +364,8 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
#define pmd_bad(pmd) (!(pmd_val(pmd) & PMD_TABLE_BIT))
+#define pmd_valid(pmd) (!!(pmd_val(pmd) & PMD_VALID))
+
#define pmd_table(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
PMD_TYPE_TABLE)
#define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
@@ -428,6 +430,7 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
#define pud_none(pud) (!pud_val(pud))
#define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
+#define pud_valid(pud) (!!(pud_val(pud) & PUD_VALID))
This will break compilation for CONFIG_PGTABLE_LEVELS <= 2
Post by AKASHI Takahiro
#define pud_present(pud) (pud_val(pud))
static inline void set_pud(pud_t *pudp, pud_t pud)
@@ -481,6 +484,7 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
#define pgd_none(pgd) (!pgd_val(pgd))
#define pgd_bad(pgd) (!(pgd_val(pgd) & 2))
+#define pgd_valid(pgd) (!!(pgd_val(pgd) & 1))
This has not been used anywhere.
Post by AKASHI Takahiro
#define pgd_present(pgd) (pgd_val(pgd))
static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..9c7adcce8e4e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -133,7 +133,8 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
* Set the contiguous bit for the subsequent group of PTEs if
* its size and alignment are appropriate.
*/
- if (((addr | PFN_PHYS(pfn)) & ~CONT_PTE_MASK) == 0) {
+ if ((pgprot_val(prot) & PTE_VALID) &&
+ (((addr | PFN_PHYS(pfn)) & ~CONT_PTE_MASK) == 0)) {
if (end - addr >= CONT_PTE_SIZE && !page_mappings_only)
__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
else
@@ -147,7 +148,8 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
* After the PTE entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pte_val(old_pte), pte_val(*pte)));
+ BUG_ON(pte_valid(old_pte) && pte_valid(*pte) &&
+ !pgattr_change_is_safe(pte_val(old_pte), pte_val(*pte)));
} while (pte++, addr += PAGE_SIZE, addr != end);
@@ -190,7 +192,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
* Set the contiguous bit for the subsequent group of
* PMDs if its size and alignment are appropriate.
*/
- if (((addr | phys) & ~CONT_PMD_MASK) == 0) {
+ if ((pgprot_val(prot) | PMD_VALID) &&
+ ((addr | phys) & ~CONT_PMD_MASK) == 0) {
if (end - addr >= CONT_PMD_SIZE)
__prot = __pgprot(pgprot_val(prot) |
PTE_CONT);
@@ -203,7 +206,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
* After the PMD entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
+ BUG_ON(pmd_valid(old_pmd) && pmd_valid(*pmd) &&
+ !pgattr_change_is_safe(pmd_val(old_pmd),
pmd_val(*pmd)));
} else {
alloc_init_pte(pmd, addr, next, __phys_to_pfn(phys),
@@ -263,7 +267,8 @@ static void alloc_init_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
* After the PUD entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
+ BUG_ON(pud_valid(old_pud) && pud_valid(*pud) &&
+ !pgattr_change_is_safe(pud_val(old_pud),
pud_val(*pud)));
} else {
alloc_init_pmd(pud, addr, next, phys, prot,
@@ -344,8 +349,8 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
pgd_pgtable_alloc, page_mappings_only);
}
-static void create_mapping_late(phys_addr_t phys, unsigned long virt,
- phys_addr_t size, pgprot_t prot)
+void create_mapping_late(phys_addr_t phys, unsigned long virt,
+ phys_addr_t size, pgprot_t prot)
{
if (virt < VMALLOC_START) {
pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
@@ -791,14 +796,20 @@ int __init arch_ioremap_pmd_supported(void)
int pud_set_huge(pud_t *pud, phys_addr_t phys, pgprot_t prot)
{
BUG_ON(phys & ~PUD_MASK);
- set_pud(pud, __pud(phys | PUD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+ set_pud(pud, __pud(phys |
+ ((pgprot_val(prot) & PUD_VALID) ?
+ PUD_TYPE_SECT : 0) |
+ pgprot_val(mk_sect_prot(prot))));
return 1;
}
int pmd_set_huge(pmd_t *pmd, phys_addr_t phys, pgprot_t prot)
{
BUG_ON(phys & ~PMD_MASK);
- set_pmd(pmd, __pmd(phys | PMD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+ set_pmd(pmd, __pmd(phys |
+ ((pgprot_val(prot) & PMD_VALID) ?
+ PMD_TYPE_SECT : 0) |
+ pgprot_val(mk_sect_prot(prot))));
return 1;
}
~Pratyush
AKASHI Takahiro
2017-01-25 06:37:00 UTC
Permalink
Post by Pratyush Anand
Post by AKASHI Takahiro
The current implementation of create_mapping_late() is only allowed
to modify permission attributes (read-only or read-write) against
the existing kernel mapping.
In this patch, PAGE_KERNEL_INVALID protection attribute is introduced.
We will now be able to invalidate (or unmap) some part of the existing
kernel mapping by specifying PAGE_KERNEL_INVALID to create_mapping_late().
This feature will be used in a suceeding kdump patch to protect
the memory reserved for crash dump kernel once after loaded.
---
arch/arm64/include/asm/mmu.h | 2 ++
arch/arm64/include/asm/pgtable-hwdef.h | 2 ++
arch/arm64/include/asm/pgtable-prot.h | 1 +
arch/arm64/include/asm/pgtable.h | 4 ++++
arch/arm64/mm/mmu.c | 29 ++++++++++++++++++++---------
5 files changed, 29 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 47619411f0ff..a6c1367527bc 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -36,6 +36,8 @@ extern void init_mem_pgprot(void);
extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
unsigned long virt, phys_addr_t size,
pgprot_t prot, bool page_mappings_only);
+extern void create_mapping_late(phys_addr_t phys, unsigned long virt,
+ phys_addr_t size, pgprot_t prot);
extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
#endif
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index eb0c2bd90de9..e66efec31ca9 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -119,6 +119,7 @@
#define PUD_TABLE_BIT (_AT(pgdval_t, 1) << 1)
#define PUD_TYPE_MASK (_AT(pgdval_t, 3) << 0)
#define PUD_TYPE_SECT (_AT(pgdval_t, 1) << 0)
+#define PUD_VALID PUD_TYPE_SECT
/*
* Level 2 descriptor (PMD).
@@ -128,6 +129,7 @@
#define PMD_TYPE_TABLE (_AT(pmdval_t, 3) << 0)
#define PMD_TYPE_SECT (_AT(pmdval_t, 1) << 0)
#define PMD_TABLE_BIT (_AT(pmdval_t, 1) << 1)
+#define PMD_VALID PMD_TYPE_SECT
/*
* Section
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 2142c7726e76..945d84cd5df7 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -54,6 +54,7 @@
#define PAGE_KERNEL_ROX __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_RDONLY)
#define PAGE_KERNEL_EXEC __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE)
#define PAGE_KERNEL_EXEC_CONT __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_CONT)
+#define PAGE_KERNEL_INVALID __pgprot(0)
#define PAGE_HYP __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_HYP_XN)
#define PAGE_HYP_EXEC __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ffbb9a520563..1904a7c07018 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -364,6 +364,8 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
#define pmd_bad(pmd) (!(pmd_val(pmd) & PMD_TABLE_BIT))
+#define pmd_valid(pmd) (!!(pmd_val(pmd) & PMD_VALID))
+
#define pmd_table(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
PMD_TYPE_TABLE)
#define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) == \
@@ -428,6 +430,7 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
#define pud_none(pud) (!pud_val(pud))
#define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
+#define pud_valid(pud) (!!(pud_val(pud) & PUD_VALID))
This will break compilation for CONFIG_PGTABLE_LEVELS <= 2
Ah, yes. A quick fix is as follows:

===8<===
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1904a7c07018..dc11d4bf332c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -467,6 +467,8 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)

#else

+#define pud_valid(pud) (1)
+
#define pud_page_paddr(pud) ({ BUILD_BUG(); 0; })

/* Match pmd_offset folding in <asm/generic/pgtable-nopmd.h> */
@@ -520,6 +522,8 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)

#else

+#define pgd_valid(pgd) (1)
+
#define pgd_page_paddr(pgd) ({ BUILD_BUG(); 0;})

/* Match pud_offset folding in <asm/generic/pgtable-nopud.h> */
===>8===

Now I've confirmed that it compiles under the configuration with
* 4KB page x 39, 48-bit address space
* 64KB page x 42, 48-bit address space
and also verified a crash dump image for 64KB x 42/48b cases.
Post by Pratyush Anand
Post by AKASHI Takahiro
#define pud_present(pud) (pud_val(pud))
static inline void set_pud(pud_t *pudp, pud_t pud)
@@ -481,6 +484,7 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
#define pgd_none(pgd) (!pgd_val(pgd))
#define pgd_bad(pgd) (!(pgd_val(pgd) & 2))
+#define pgd_valid(pgd) (!!(pgd_val(pgd) & 1))
This has not been used anywhere.
Well, this patch actually also breaks ptdump (debugfs/kernel_page_tables)
as a descriptor can be non-zero yet invalid after applying this patch.
Once it is accepted, I will post another patch which will fix the issue.
pgd_valid() is used in that patch.

Thanks,
-Takahiro AKASHI
Post by Pratyush Anand
Post by AKASHI Takahiro
#define pgd_present(pgd) (pgd_val(pgd))
static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..9c7adcce8e4e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -133,7 +133,8 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
* Set the contiguous bit for the subsequent group of PTEs if
* its size and alignment are appropriate.
*/
- if (((addr | PFN_PHYS(pfn)) & ~CONT_PTE_MASK) == 0) {
+ if ((pgprot_val(prot) & PTE_VALID) &&
+ (((addr | PFN_PHYS(pfn)) & ~CONT_PTE_MASK) == 0)) {
if (end - addr >= CONT_PTE_SIZE && !page_mappings_only)
__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
else
@@ -147,7 +148,8 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
* After the PTE entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pte_val(old_pte), pte_val(*pte)));
+ BUG_ON(pte_valid(old_pte) && pte_valid(*pte) &&
+ !pgattr_change_is_safe(pte_val(old_pte), pte_val(*pte)));
} while (pte++, addr += PAGE_SIZE, addr != end);
@@ -190,7 +192,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
* Set the contiguous bit for the subsequent group of
* PMDs if its size and alignment are appropriate.
*/
- if (((addr | phys) & ~CONT_PMD_MASK) == 0) {
+ if ((pgprot_val(prot) | PMD_VALID) &&
+ ((addr | phys) & ~CONT_PMD_MASK) == 0) {
if (end - addr >= CONT_PMD_SIZE)
__prot = __pgprot(pgprot_val(prot) |
PTE_CONT);
@@ -203,7 +206,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
* After the PMD entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
+ BUG_ON(pmd_valid(old_pmd) && pmd_valid(*pmd) &&
+ !pgattr_change_is_safe(pmd_val(old_pmd),
pmd_val(*pmd)));
} else {
alloc_init_pte(pmd, addr, next, __phys_to_pfn(phys),
@@ -263,7 +267,8 @@ static void alloc_init_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
* After the PUD entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
+ BUG_ON(pud_valid(old_pud) && pud_valid(*pud) &&
+ !pgattr_change_is_safe(pud_val(old_pud),
pud_val(*pud)));
} else {
alloc_init_pmd(pud, addr, next, phys, prot,
@@ -344,8 +349,8 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
pgd_pgtable_alloc, page_mappings_only);
}
-static void create_mapping_late(phys_addr_t phys, unsigned long virt,
- phys_addr_t size, pgprot_t prot)
+void create_mapping_late(phys_addr_t phys, unsigned long virt,
+ phys_addr_t size, pgprot_t prot)
{
if (virt < VMALLOC_START) {
pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
@@ -791,14 +796,20 @@ int __init arch_ioremap_pmd_supported(void)
int pud_set_huge(pud_t *pud, phys_addr_t phys, pgprot_t prot)
{
BUG_ON(phys & ~PUD_MASK);
- set_pud(pud, __pud(phys | PUD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+ set_pud(pud, __pud(phys |
+ ((pgprot_val(prot) & PUD_VALID) ?
+ PUD_TYPE_SECT : 0) |
+ pgprot_val(mk_sect_prot(prot))));
return 1;
}
int pmd_set_huge(pmd_t *pmd, phys_addr_t phys, pgprot_t prot)
{
BUG_ON(phys & ~PMD_MASK);
- set_pmd(pmd, __pmd(phys | PMD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+ set_pmd(pmd, __pmd(phys |
+ ((pgprot_val(prot) & PMD_VALID) ?
+ PMD_TYPE_SECT : 0) |
+ pgprot_val(mk_sect_prot(prot))));
return 1;
}
~Pratyush
James Morse
2017-01-25 15:49:56 UTC
Permalink
Hi Akashi,
Post by AKASHI Takahiro
The current implementation of create_mapping_late() is only allowed
to modify permission attributes (read-only or read-write) against
the existing kernel mapping.
In this patch, PAGE_KERNEL_INVALID protection attribute is introduced.
We will now be able to invalidate (or unmap) some part of the existing
Can we stick to calling this 'unmap', otherwise 'invalidate this page range'
becomes ambiguous, cache maintenance or page-table manipulation?!
Post by AKASHI Takahiro
kernel mapping by specifying PAGE_KERNEL_INVALID to create_mapping_late().
This feature will be used in a suceeding kdump patch to protect
the memory reserved for crash dump kernel once after loaded.
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..9c7adcce8e4e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -190,7 +192,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
* Set the contiguous bit for the subsequent group of
* PMDs if its size and alignment are appropriate.
*/
- if (((addr | phys) & ~CONT_PMD_MASK) == 0) {
+ if ((pgprot_val(prot) | PMD_VALID) &&
& PMD_VALID?
Post by AKASHI Takahiro
+ ((addr | phys) & ~CONT_PMD_MASK) == 0) {
if (end - addr >= CONT_PMD_SIZE)
__prot = __pgprot(pgprot_val(prot) |
PTE_CONT);
@@ -203,7 +206,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
* After the PMD entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
+ BUG_ON(pmd_valid(old_pmd) && pmd_valid(*pmd) &&
+ !pgattr_change_is_safe(pmd_val(old_pmd),
pmd_val(*pmd)));
Could you add PTE_VALID checks to pgattr_change_is_safe() instead of at every
call? I think (old == 0 || new == 0) is probably doing something similar.
Post by AKASHI Takahiro
} else {
alloc_init_pte(pmd, addr, next, __phys_to_pfn(phys),
@@ -791,14 +796,20 @@ int __init arch_ioremap_pmd_supported(void)
int pud_set_huge(pud_t *pud, phys_addr_t phys, pgprot_t prot)
{
BUG_ON(phys & ~PUD_MASK);
- set_pud(pud, __pud(phys | PUD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+ set_pud(pud, __pud(phys |
+ ((pgprot_val(prot) & PUD_VALID) ?
+ PUD_TYPE_SECT : 0) |
+ pgprot_val(mk_sect_prot(prot))));
prot = mk_sect_prot(prot);
if (pgprot_val(prot) & PUD_VALID)
prot |= PUD_TYPE_SECT;
set_pud(pud, __pud(phys | pgprot_val(prot)));
Given you defined PUD_VALID as being PUD_TYPE_SECT, I think you can then reduce
Post by AKASHI Takahiro
set_pud(pud, __pud(phys | pgprot_val(mk_sect_prot(prot))));
It looks like mk_sect_prot() doesn't touch the valid bit so all this is doing is
clearing the table bit making it a section and keeping the valid bit from the
caller.


Thanks,

James
AKASHI Takahiro
2017-01-26 08:08:08 UTC
Permalink
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
The current implementation of create_mapping_late() is only allowed
to modify permission attributes (read-only or read-write) against
the existing kernel mapping.
In this patch, PAGE_KERNEL_INVALID protection attribute is introduced.
We will now be able to invalidate (or unmap) some part of the existing
Can we stick to calling this 'unmap', otherwise 'invalidate this page range'
becomes ambiguous, cache maintenance or page-table manipulation?!
Sure.
I hesitated to use "unmap" because we don't free any of descriptor table
pages here, but who notice the differences.
Post by James Morse
Post by AKASHI Takahiro
kernel mapping by specifying PAGE_KERNEL_INVALID to create_mapping_late().
This feature will be used in a suceeding kdump patch to protect
the memory reserved for crash dump kernel once after loaded.
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..9c7adcce8e4e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -190,7 +192,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
* Set the contiguous bit for the subsequent group of
* PMDs if its size and alignment are appropriate.
*/
- if (((addr | phys) & ~CONT_PMD_MASK) == 0) {
+ if ((pgprot_val(prot) | PMD_VALID) &&
& PMD_VALID?
Shame on me.
Post by James Morse
Post by AKASHI Takahiro
+ ((addr | phys) & ~CONT_PMD_MASK) == 0) {
if (end - addr >= CONT_PMD_SIZE)
__prot = __pgprot(pgprot_val(prot) |
PTE_CONT);
@@ -203,7 +206,8 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
* After the PMD entry has been populated once, we
* only allow updates to the permission attributes.
*/
- BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
+ BUG_ON(pmd_valid(old_pmd) && pmd_valid(*pmd) &&
+ !pgattr_change_is_safe(pmd_val(old_pmd),
pmd_val(*pmd)));
Could you add PTE_VALID checks to pgattr_change_is_safe() instead of at every
call? I think (old == 0 || new == 0) is probably doing something similar.
Good catch :)
Post by James Morse
Post by AKASHI Takahiro
} else {
alloc_init_pte(pmd, addr, next, __phys_to_pfn(phys),
@@ -791,14 +796,20 @@ int __init arch_ioremap_pmd_supported(void)
int pud_set_huge(pud_t *pud, phys_addr_t phys, pgprot_t prot)
{
BUG_ON(phys & ~PUD_MASK);
- set_pud(pud, __pud(phys | PUD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+ set_pud(pud, __pud(phys |
+ ((pgprot_val(prot) & PUD_VALID) ?
+ PUD_TYPE_SECT : 0) |
+ pgprot_val(mk_sect_prot(prot))));
prot = mk_sect_prot(prot);
if (pgprot_val(prot) & PUD_VALID)
prot |= PUD_TYPE_SECT;
set_pud(pud, __pud(phys | pgprot_val(prot)));
Yeah, I just wanted to keep it simple,
Post by James Morse
Given you defined PUD_VALID as being PUD_TYPE_SECT, I think you can then reduce
Post by AKASHI Takahiro
set_pud(pud, __pud(phys | pgprot_val(mk_sect_prot(prot))));
It looks like mk_sect_prot() doesn't touch the valid bit so all this is doing is
clearing the table bit making it a section and keeping the valid bit from the
caller.
but this seems to be too much optimized because, even without my patch,
this change is applicable. The intention of the original code would be
to make sure the entry be a pud-level descriptor whatever given "prot" is.

-Takahiro AKASHI
Post by James Morse
Thanks,
James
AKASHI Takahiro
2017-01-24 08:50:00 UTC
Permalink
Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
and save registers' status in per-cpu ELF notes before starting crash
dump kernel. See kernel_kexec().
Even if not all secondary cpus have shut down, we do kdump anyway.

As we don't have to make non-boot(crashed) cpus offline (to preserve
correct status of cpus at crash dump) before shutting down, this patch
also adds a variant of smp_send_stop().

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 42 +++++++++++++++++++++++++-
arch/arm64/include/asm/smp.h | 2 ++
arch/arm64/kernel/machine_kexec.c | 55 +++++++++++++++++++++++++++++++---
arch/arm64/kernel/smp.c | 63 +++++++++++++++++++++++++++++++++++++++
5 files changed, 158 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/hardirq.h b/arch/arm64/include/asm/hardirq.h
index 8740297dac77..1473fc2f7ab7 100644
--- a/arch/arm64/include/asm/hardirq.h
+++ b/arch/arm64/include/asm/hardirq.h
@@ -20,7 +20,7 @@
#include <linux/threads.h>
#include <asm/irq.h>

-#define NR_IPI 6
+#define NR_IPI 7

typedef struct {
unsigned int __softirq_pending;
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 04744dc5fb61..f40ace1fa21a 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -40,7 +40,47 @@
static inline void crash_setup_regs(struct pt_regs *newregs,
struct pt_regs *oldregs)
{
- /* Empty routine needed to avoid build errors. */
+ if (oldregs) {
+ memcpy(newregs, oldregs, sizeof(*newregs));
+ } else {
+ u64 tmp1, tmp2;
+
+ __asm__ __volatile__ (
+ "stp x0, x1, [%2, #16 * 0]\n"
+ "stp x2, x3, [%2, #16 * 1]\n"
+ "stp x4, x5, [%2, #16 * 2]\n"
+ "stp x6, x7, [%2, #16 * 3]\n"
+ "stp x8, x9, [%2, #16 * 4]\n"
+ "stp x10, x11, [%2, #16 * 5]\n"
+ "stp x12, x13, [%2, #16 * 6]\n"
+ "stp x14, x15, [%2, #16 * 7]\n"
+ "stp x16, x17, [%2, #16 * 8]\n"
+ "stp x18, x19, [%2, #16 * 9]\n"
+ "stp x20, x21, [%2, #16 * 10]\n"
+ "stp x22, x23, [%2, #16 * 11]\n"
+ "stp x24, x25, [%2, #16 * 12]\n"
+ "stp x26, x27, [%2, #16 * 13]\n"
+ "stp x28, x29, [%2, #16 * 14]\n"
+ "mov %0, sp\n"
+ "stp x30, %0, [%2, #16 * 15]\n"
+
+ "/* faked current PSTATE */\n"
+ "mrs %0, CurrentEL\n"
+ "mrs %1, SPSEL\n"
+ "orr %0, %0, %1\n"
+ "mrs %1, DAIF\n"
+ "orr %0, %0, %1\n"
+ "mrs %1, NZCV\n"
+ "orr %0, %0, %1\n"
+ /* pc */
+ "adr %1, 1f\n"
+ "1:\n"
+ "stp %1, %0, [%2, #16 * 16]\n"
+ : "=&r" (tmp1), "=&r" (tmp2)
+ : "r" (newregs)
+ : "memory"
+ );
+ }
}

#endif /* __ASSEMBLY__ */
diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index d050d720a1b4..cea009f2657d 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -148,6 +148,8 @@ static inline void cpu_panic_kernel(void)
*/
bool cpus_are_stuck_in_kernel(void);

+extern void smp_send_crash_stop(void);
+
#endif /* ifndef __ASSEMBLY__ */

#endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index f7938fecf3ff..d56ea8c805a8 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -9,6 +9,9 @@
* published by the Free Software Foundation.
*/

+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
#include <linux/kexec.h>
#include <linux/smp.h>

@@ -161,7 +164,8 @@ void machine_kexec(struct kimage *kimage)
/*
* New cpus may have become stuck_in_kernel after we loaded the image.
*/
- BUG_ON(cpus_are_stuck_in_kernel() || (num_online_cpus() > 1));
+ BUG_ON((cpus_are_stuck_in_kernel() || (num_online_cpus() > 1)) &&
+ !WARN_ON(kimage == kexec_crash_image));

reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);

@@ -200,15 +204,58 @@ void machine_kexec(struct kimage *kimage)
* relocation is complete.
*/

- cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head,
- kimage->start, 0);
+ cpu_soft_restart(kimage != kexec_crash_image,
+ reboot_code_buffer_phys, kimage->head, kimage->start, 0);

BUG(); /* Should never get here. */
}

+static void machine_kexec_mask_interrupts(void)
+{
+ unsigned int i;
+ struct irq_desc *desc;
+
+ for_each_irq_desc(i, desc) {
+ struct irq_chip *chip;
+ int ret;
+
+ chip = irq_desc_get_chip(desc);
+ if (!chip)
+ continue;
+
+ /*
+ * First try to remove the active state. If this
+ * fails, try to EOI the interrupt.
+ */
+ ret = irq_set_irqchip_state(i, IRQCHIP_STATE_ACTIVE, false);
+
+ if (ret && irqd_irq_inprogress(&desc->irq_data) &&
+ chip->irq_eoi)
+ chip->irq_eoi(&desc->irq_data);
+
+ if (chip->irq_mask)
+ chip->irq_mask(&desc->irq_data);
+
+ if (chip->irq_disable && !irqd_irq_disabled(&desc->irq_data))
+ chip->irq_disable(&desc->irq_data);
+ }
+}
+
+/**
+ * machine_crash_shutdown - shutdown non-crashing cpus and save registers
+ */
void machine_crash_shutdown(struct pt_regs *regs)
{
- /* Empty routine needed to avoid build errors. */
+ local_irq_disable();
+
+ /* shutdown non-crashing cpus */
+ smp_send_crash_stop();
+
+ /* for crashing cpu */
+ crash_save_cpu(regs, smp_processor_id());
+ machine_kexec_mask_interrupts();
+
+ pr_info("Starting crashdump kernel...\n");
}

void arch_kexec_protect_crashkres(void)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index cb87234cfcf2..446c6d48f8ec 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -37,6 +37,7 @@
#include <linux/completion.h>
#include <linux/of.h>
#include <linux/irq_work.h>
+#include <linux/kexec.h>

#include <asm/alternative.h>
#include <asm/atomic.h>
@@ -74,6 +75,7 @@ enum ipi_msg_type {
IPI_RESCHEDULE,
IPI_CALL_FUNC,
IPI_CPU_STOP,
+ IPI_CPU_CRASH_STOP,
IPI_TIMER,
IPI_IRQ_WORK,
IPI_WAKEUP
@@ -753,6 +755,7 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
S(IPI_RESCHEDULE, "Rescheduling interrupts"),
S(IPI_CALL_FUNC, "Function call interrupts"),
S(IPI_CPU_STOP, "CPU stop interrupts"),
+ S(IPI_CPU_CRASH_STOP, "CPU stop (for crash dump) interrupts"),
S(IPI_TIMER, "Timer broadcast interrupts"),
S(IPI_IRQ_WORK, "IRQ work interrupts"),
S(IPI_WAKEUP, "CPU wake-up interrupts"),
@@ -827,6 +830,29 @@ static void ipi_cpu_stop(unsigned int cpu)
cpu_relax();
}

+#ifdef CONFIG_KEXEC_CORE
+static atomic_t waiting_for_crash_ipi;
+#endif
+
+static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
+{
+#ifdef CONFIG_KEXEC_CORE
+ crash_save_cpu(regs, cpu);
+
+ atomic_dec(&waiting_for_crash_ipi);
+
+ local_irq_disable();
+
+#ifdef CONFIG_HOTPLUG_CPU
+ if (cpu_ops[cpu]->cpu_die)
+ cpu_ops[cpu]->cpu_die(cpu);
+#endif
+
+ /* just in case */
+ cpu_park_loop();
+#endif
+}
+
/*
* Main handler for inter-processor interrupts
*/
@@ -857,6 +883,15 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
irq_exit();
break;

+ case IPI_CPU_CRASH_STOP:
+ if (IS_ENABLED(CONFIG_KEXEC_CORE)) {
+ irq_enter();
+ ipi_cpu_crash_stop(cpu, regs);
+
+ unreachable();
+ }
+ break;
+
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
case IPI_TIMER:
irq_enter();
@@ -929,6 +964,34 @@ void smp_send_stop(void)
cpumask_pr_args(cpu_online_mask));
}

+#ifdef CONFIG_KEXEC_CORE
+void smp_send_crash_stop(void)
+{
+ cpumask_t mask;
+ unsigned long timeout;
+
+ if (num_online_cpus() == 1)
+ return;
+
+ cpumask_copy(&mask, cpu_online_mask);
+ cpumask_clear_cpu(smp_processor_id(), &mask);
+
+ atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
+
+ pr_crit("SMP: stopping secondary CPUs\n");
+ smp_cross_call(&mask, IPI_CPU_CRASH_STOP);
+
+ /* Wait up to one second for other CPUs to stop */
+ timeout = USEC_PER_SEC;
+ while ((atomic_read(&waiting_for_crash_ipi) > 0) && timeout--)
+ udelay(1);
+
+ if (atomic_read(&waiting_for_crash_ipi) > 0)
+ pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
+ cpumask_pr_args(cpu_online_mask));
+}
+#endif
+
/*
* not supported here
*/
--
2.11.0
AKASHI Takahiro
2017-01-24 08:50:01 UTC
Permalink
In addition to common VMCOREINFO's defined in
crash_save_vmcoreinfo_init(), we need to know, for crash utility,
- kimage_voffset
- PHYS_OFFSET
to examine the contents of a dump file (/proc/vmcore) correctly
due to the introduction of KASLR (CONFIG_RANDOMIZE_BASE) in v4.6.

- VA_BITS
is also required for makedumpfile command.

arch_crash_save_vmcoreinfo() appends them to the dump file.
More VMCOREINFO's may be added later.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/kernel/machine_kexec.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index d56ea8c805a8..84c5761af336 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -17,6 +17,7 @@

#include <asm/cacheflush.h>
#include <asm/cpu_ops.h>
+#include <asm/memory.h>
#include <asm/mmu.h>
#include <asm/mmu_context.h>

@@ -275,3 +276,13 @@ void arch_kexec_unprotect_crashkres(void)

flush_tlb_all();
}
+
+void arch_crash_save_vmcoreinfo(void)
+{
+ VMCOREINFO_NUMBER(VA_BITS);
+ /* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
+ vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
+ kimage_voffset);
+ vmcoreinfo_append_str("NUMBER(PHYS_OFFSET)=0x%llx\n",
+ PHYS_OFFSET);
+}
--
2.11.0
AKASHI Takahiro
2017-01-24 08:50:02 UTC
Permalink
Add arch-specific functions to provide a dump file, /proc/vmcore.

This file is in ELF format and its ELF header needs to be prepared by
userspace tools, like kexec-tools, in adance. The primary kernel is
responsible to allocate the region with reserve_elfcorehdr() at boot time
and advertize its location to crash dump kernel via a new device-tree
property, "linux,elfcorehdr".

Then crash dump kernel will access the primary kernel's memory with
copy_oldmem_page(), which feeds the data page-by-page by ioremap'ing it
since it does not reside in linear mapping on crash dump kernel.

We also need our own elfcorehdr_read() here since the header is placed
within crash dump kernel's usable memory.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/Kconfig | 11 +++++++
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/crash_dump.c | 71 ++++++++++++++++++++++++++++++++++++++++++
arch/arm64/mm/init.c | 53 +++++++++++++++++++++++++++++++
4 files changed, 136 insertions(+)
create mode 100644 arch/arm64/kernel/crash_dump.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 111742126897..2bd6a1a062b9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -693,6 +693,17 @@ config KEXEC
but it is independent of the system firmware. And like a reboot
you can start any kernel with it, not just Linux.

+config CRASH_DUMP
+ bool "Build kdump crash kernel"
+ help
+ Generate crash dump after being started by kexec. This should
+ be normally only set in special crash dump kernels which are
+ loaded in the main kernel with kexec-tools into a specially
+ reserved region and then later executed after a crash by
+ kdump/kexec.
+
+ For more details see Documentation/kdump/kdump.txt
+
config XEN_DOM0
def_bool y
depends on XEN
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 7d66bbaafc0c..6a7384eee08d 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -50,6 +50,7 @@ arm64-obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
arm64-obj-$(CONFIG_HIBERNATION) += hibernate.o hibernate-asm.o
arm64-obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o \
cpu-reset.o
+arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o

obj-y += $(arm64-obj-y) vdso/ probes/
obj-m += $(arm64-obj-m)
diff --git a/arch/arm64/kernel/crash_dump.c b/arch/arm64/kernel/crash_dump.c
new file mode 100644
index 000000000000..c3d5a21c081e
--- /dev/null
+++ b/arch/arm64/kernel/crash_dump.c
@@ -0,0 +1,71 @@
+/*
+ * Routines for doing kexec-based kdump
+ *
+ * Copyright (C) 2014 Linaro Limited
+ * Author: AKASHI Takahiro <***@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/crash_dump.h>
+#include <linux/errno.h>
+#include <linux/io.h>
+#include <linux/memblock.h>
+#include <linux/uaccess.h>
+#include <asm/memory.h>
+
+/**
+ * copy_oldmem_page() - copy one page from old kernel memory
+ * @pfn: page frame number to be copied
+ * @buf: buffer where the copied page is placed
+ * @csize: number of bytes to copy
+ * @offset: offset in bytes into the page
+ * @userbuf: if set, @buf is in a user address space
+ *
+ * This function copies one page from old kernel memory into buffer pointed by
+ * @buf. If @buf is in userspace, set @userbuf to %1. Returns number of bytes
+ * copied or negative error in case of failure.
+ */
+ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
+ size_t csize, unsigned long offset,
+ int userbuf)
+{
+ void *vaddr;
+
+ if (!csize)
+ return 0;
+
+ vaddr = memremap(__pfn_to_phys(pfn), PAGE_SIZE, MEMREMAP_WB);
+ if (!vaddr)
+ return -ENOMEM;
+
+ if (userbuf) {
+ if (copy_to_user((char __user *)buf, vaddr + offset, csize)) {
+ memunmap(vaddr);
+ return -EFAULT;
+ }
+ } else {
+ memcpy(buf, vaddr + offset, csize);
+ }
+
+ memunmap(vaddr);
+
+ return csize;
+}
+
+/**
+ * elfcorehdr_read - read from ELF core header
+ * @buf: buffer where the data is placed
+ * @csize: number of bytes to read
+ * @ppos: address in the memory
+ *
+ * This function reads @count bytes from elf core header which exists
+ * on crash dump kernel's memory.
+ */
+ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
+{
+ memcpy(buf, phys_to_virt((phys_addr_t)*ppos), count);
+ return count;
+}
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 2aba75dc7720..323b87197e18 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -38,6 +38,7 @@
#include <linux/swiotlb.h>
#include <linux/vmalloc.h>
#include <linux/kexec.h>
+#include <linux/crash_dump.h>

#include <asm/boot.h>
#include <asm/fixmap.h>
@@ -135,6 +136,56 @@ static void __init reserve_crashkernel(void)
}
#endif /* CONFIG_KEXEC_CORE */

+#ifdef CONFIG_CRASH_DUMP
+static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ const __be32 *reg;
+ int len;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,elfcorehdr", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ elfcorehdr_addr = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ elfcorehdr_size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+/*
+ * reserve_elfcorehdr() - reserves memory for elf core header
+ *
+ * This function reserves elf core header given in "elfcorehdr=" kernel
+ * command line parameter. This region contains all the information about
+ * primary kernel's core image and is used by a dump capture kernel to
+ * access the system memory on primary kernel.
+ */
+static void __init reserve_elfcorehdr(void)
+{
+ of_scan_flat_dt(early_init_dt_scan_elfcorehdr, NULL);
+
+ if (!elfcorehdr_size)
+ return;
+
+ if (memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)) {
+ pr_warn("elfcorehdr is overlapped\n");
+ return;
+ }
+
+ memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
+
+ pr_info("Reserving %lldKB of memory at 0x%llx for elfcorehdr\n",
+ elfcorehdr_size >> 10, elfcorehdr_addr);
+}
+#else
+static void __init reserve_elfcorehdr(void)
+{
+}
+#endif /* CONFIG_CRASH_DUMP */
/*
* Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
* currently assumes that for memory starting above 4G, 32-bit devices will
@@ -393,6 +444,8 @@ void __init arm64_memblock_init(void)

reserve_crashkernel();

+ reserve_elfcorehdr();
+
dma_contiguous_reserve(arm64_dma_phys_limit);

memblock_allow_resize();
--
2.11.0
AKASHI Takahiro
2017-01-24 08:50:03 UTC
Permalink
Kdump is enabled by default as kexec is.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/configs/defconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 33b744d54739..94c2ea523a8a 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -79,6 +79,7 @@ CONFIG_CMA=y
CONFIG_SECCOMP=y
CONFIG_XEN=y
CONFIG_KEXEC=y
+CONFIG_CRASH_DUMP=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_COMPAT=y
CONFIG_CPU_IDLE=y
--
2.11.0
AKASHI Takahiro
2017-01-24 08:50:04 UTC
Permalink
Add arch specific descriptions about kdump usage on arm64 to kdump.txt.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Baoquan He <***@redhat.com>
Acked-by: Dave Young <***@redhat.com>
Acked-by: Catalin Marinas <***@arm.com>
---
Documentation/kdump/kdump.txt | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index b0eb27b956d9..615434d81108 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
a remote system.

Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
-s390x and arm architectures.
+s390x, arm and arm64 architectures.

When the system kernel boots, it reserves a small section of memory for
the dump-capture kernel. This ensures that ongoing Direct Memory Access
@@ -249,6 +249,13 @@ Dump-capture kernel config options (Arch Dependent, arm)

AUTO_ZRELADDR=y

+Dump-capture kernel config options (Arch Dependent, arm64)
+----------------------------------------------------------
+
+- Please note that kvm of the dump-capture kernel will not be enabled
+ on non-VHE systems even if it is configured. This is because the CPU
+ will not be reset to EL2 on panic.
+
Extended crashkernel syntax
===========================

@@ -305,6 +312,8 @@ Boot into System Kernel
kernel will automatically locate the crash kernel image within the
first 512MB of RAM if X is not given.

+ On arm64, use "crashkernel=Y[@X]". Note that the start address of
+ the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).

Load the Dump-capture Kernel
============================
@@ -327,6 +336,8 @@ For s390x:
- Use image or bzImage
For arm:
- Use zImage
+For arm64:
+ - Use vmlinux or Image

If you are using a uncompressed vmlinux image then use following command
to load dump-capture kernel.
@@ -370,6 +381,9 @@ For s390x:
For arm:
"1 maxcpus=1 reset_devices"

+For arm64:
+ "1 maxcpus=1 reset_devices"
+
Notes on loading the dump-capture kernel:

* By default, the ELF headers are stored in ELF64 format to support
--
2.11.0
AKASHI Takahiro
2017-01-24 08:53:20 UTC
Permalink
From: James Morse <***@arm.com>

Add documentation for DT properties:
linux,usable-memory-range
linux,elfcorehdr
used by arm64 kdump. Those decribe the usable memory range for crash dump
kernel and the elfcorehdr's location within it, respectively.

Signed-off-by: James Morse <***@arm.com>
[***@linaro.org: added "linux,crashkernel-base" and "-size" ]
Signed-off-by: AKASHI Takahiro <***@linaro.org>
Acked-by: Mark Rutland <***@arm.com>
Cc: ***@vger.kernel.org
Cc: Rob Herring <robh+***@kernel.org>
---
Documentation/devicetree/bindings/chosen.txt | 37 ++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/Documentation/devicetree/bindings/chosen.txt b/Documentation/devicetree/bindings/chosen.txt
index 6ae9d82d4c37..8dc82431acc1 100644
--- a/Documentation/devicetree/bindings/chosen.txt
+++ b/Documentation/devicetree/bindings/chosen.txt
@@ -52,3 +52,40 @@ This property is set (currently only on PowerPC, and only needed on
book3e) by some versions of kexec-tools to tell the new kernel that it
is being booted by kexec, as the booting environment may differ (e.g.
a different secondary CPU release mechanism)
+
+linux,usable-memory-range
+-------------------------
+
+This property (arm64 only) holds a base address and size, describing a
+limited region in which memory may be considered available for use by
+the kernel. Memory outside of this range is not available for use.
+
+This property describes a limitation: memory within this range is only
+valid when also described through another mechanism that the kernel
+would otherwise use to determine available memory (e.g. memory nodes
+or the EFI memory map). Valid memory may be sparse within the range.
+e.g.
+
+/ {
+ chosen {
+ linux,usable-memory-range = <0x9 0xf0000000 0x0 0x10000000>;
+ };
+};
+
+The main usage is for crash dump kernel to identify its own usable
+memory and exclude, at its boot time, any other memory areas that are
+part of the panicked kernel's memory.
+
+linux,elfcorehdr
+----------------
+
+This property (currently used only on arm64) holds the memory range,
+the address and the size, of the elf core header which mainly describes
+the panicked kernel's memory layout as PT_LOAD segments of elf format.
+e.g.
+
+/ {
+ chosen {
+ linux,elfcorehdr = <0x9 0xfffff000 0x0 0x800>;
+ };
+};
--
2.11.0
AKASHI Takahiro
2017-01-24 08:49:59 UTC
Permalink
To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.

We also have to
- put the region in an isolated mapping, and
- move copying kexec's control_code_page to machine_kexec_prepare()
so that the region will be completely read-only after loading.

Note that the region must reside in linear mapping and have corresponding
page structures in order to be potentially freed by shrinking it through
/sys/kernel/kexec_crash_size.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
---
arch/arm64/kernel/machine_kexec.c | 68 +++++++++++++++++++++++++--------------
arch/arm64/mm/mmu.c | 34 ++++++++++++++++++++
2 files changed, 77 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index bc96c8a7fc79..f7938fecf3ff 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -14,6 +14,7 @@

#include <asm/cacheflush.h>
#include <asm/cpu_ops.h>
+#include <asm/mmu.h>
#include <asm/mmu_context.h>

#include "cpu-reset.h"
@@ -22,8 +23,6 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;

-static unsigned long kimage_start;
-
/**
* kexec_image_info - For debugging output.
*/
@@ -64,7 +63,7 @@ void machine_kexec_cleanup(struct kimage *kimage)
*/
int machine_kexec_prepare(struct kimage *kimage)
{
- kimage_start = kimage->start;
+ void *reboot_code_buffer;

kexec_image_info(kimage);

@@ -73,6 +72,21 @@ int machine_kexec_prepare(struct kimage *kimage)
return -EBUSY;
}

+ reboot_code_buffer =
+ phys_to_virt(page_to_phys(kimage->control_code_page));
+
+ /*
+ * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
+ * after the kernel is shut down.
+ */
+ memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
+ arm64_relocate_new_kernel_size);
+
+ /* Flush the reboot_code_buffer in preparation for its execution. */
+ __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
+ flush_icache_range((uintptr_t)reboot_code_buffer,
+ arm64_relocate_new_kernel_size);
+
return 0;
}

@@ -143,7 +157,6 @@ static void kexec_segment_flush(const struct kimage *kimage)
void machine_kexec(struct kimage *kimage)
{
phys_addr_t reboot_code_buffer_phys;
- void *reboot_code_buffer;

/*
* New cpus may have become stuck_in_kernel after we loaded the image.
@@ -151,7 +164,6 @@ void machine_kexec(struct kimage *kimage)
BUG_ON(cpus_are_stuck_in_kernel() || (num_online_cpus() > 1));

reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
- reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);

kexec_image_info(kimage);

@@ -159,32 +171,20 @@ void machine_kexec(struct kimage *kimage)
kimage->control_code_page);
pr_debug("%s:%d: reboot_code_buffer_phys: %pa\n", __func__, __LINE__,
&reboot_code_buffer_phys);
- pr_debug("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__,
- reboot_code_buffer);
pr_debug("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__,
arm64_relocate_new_kernel);
pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
__func__, __LINE__, arm64_relocate_new_kernel_size,
arm64_relocate_new_kernel_size);

- /*
- * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
- * after the kernel is shut down.
- */
- memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
- arm64_relocate_new_kernel_size);
-
- /* Flush the reboot_code_buffer in preparation for its execution. */
- __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
- flush_icache_range((uintptr_t)reboot_code_buffer,
- arm64_relocate_new_kernel_size);
-
- /* Flush the kimage list and its buffers. */
- kexec_list_flush(kimage);
+ if (kimage != kexec_crash_image) {
+ /* Flush the kimage list and its buffers. */
+ kexec_list_flush(kimage);

- /* Flush the new image if already in place. */
- if (kimage->head & IND_DONE)
- kexec_segment_flush(kimage);
+ /* Flush the new image if already in place. */
+ if (kimage->head & IND_DONE)
+ kexec_segment_flush(kimage);
+ }

pr_info("Bye!\n");

@@ -201,7 +201,7 @@ void machine_kexec(struct kimage *kimage)
*/

cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head,
- kimage_start, 0);
+ kimage->start, 0);

BUG(); /* Should never get here. */
}
@@ -210,3 +210,21 @@ void machine_crash_shutdown(struct pt_regs *regs)
{
/* Empty routine needed to avoid build errors. */
}
+
+void arch_kexec_protect_crashkres(void)
+{
+ kexec_segment_flush(kexec_crash_image);
+
+ create_mapping_late(crashk_res.start, __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL_INVALID);
+
+ flush_tlb_all();
+}
+
+void arch_kexec_unprotect_crashkres(void)
+{
+ create_mapping_late(crashk_res.start, __phys_to_virt(crashk_res.start),
+ resource_size(&crashk_res), PAGE_KERNEL);
+
+ flush_tlb_all();
+}
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9c7adcce8e4e..2d4a0b68a852 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
@@ -367,6 +368,39 @@ static void __init __map_memblock(pgd_t *pgd, phys_addr_t start, phys_addr_t end
unsigned long kernel_start = __pa(_text);
unsigned long kernel_end = __pa(__init_begin);

+#ifdef CONFIG_KEXEC_CORE
+ /*
+ * While crash dump kernel memory is contained in a single memblock
+ * for now, it should appear in an isolated mapping so that we can
+ * independently unmap the region later.
+ */
+ if (crashk_res.end && crashk_res.start >= start &&
+ crashk_res.end <= end) {
+ if (crashk_res.start != start)
+ __create_pgd_mapping(pgd, start, __phys_to_virt(start),
+ crashk_res.start - start,
+ PAGE_KERNEL,
+ early_pgtable_alloc,
+ debug_pagealloc_enabled());
+
+ /* before kexec_load(), the region can be read-writable. */
+ __create_pgd_mapping(pgd, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ crashk_res.end - crashk_res.start + 1,
+ PAGE_KERNEL, early_pgtable_alloc,
+ debug_pagealloc_enabled());
+
+ if (crashk_res.end != end)
+ __create_pgd_mapping(pgd, crashk_res.end + 1,
+ __phys_to_virt(crashk_res.end + 1),
+ end - crashk_res.end - 1,
+ PAGE_KERNEL,
+ early_pgtable_alloc,
+ debug_pagealloc_enabled());
+ return;
+ }
+#endif
+
/*
* Take care not to create a writable alias for the
* read-only text and rodata sections of the kernel image.
--
2.11.0
James Morse
2017-01-25 17:37:38 UTC
Permalink
Hi Akashi,
Post by AKASHI Takahiro
To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.
We also have to
- put the region in an isolated mapping, and
- move copying kexec's control_code_page to machine_kexec_prepare()
so that the region will be completely read-only after loading.
Note that the region must reside in linear mapping and have corresponding
page structures in order to be potentially freed by shrinking it through
/sys/kernel/kexec_crash_size.
Nasty! Presumably you have to build the crash region out of individual page
mappings so that they can be returned to the slab-allocator one page at a time,
and still be able to set/clear the valid bits on the remaining chunk.
(I don't see how that happens in this patch)

debug_pagealloc has to do this too so it can flip the valid bits one page at a
time. You could change the debug_pagealloc_enabled() value passed in at the top
__create_pgd_mapping() level to be a needs_per_page_mapping(addr, size) test
that happens as we build the linear map. (This would save the 3 extra calls to
__create_pgd_mapping() in __map_memblock())

I'm glad to see you can't resize the region if a crash kernel is loaded!

This secretly-unmapped is the sort of thing that breaks hibernate, it blindly
assumes pfn_valid() means it can access the page if it wants to. Setting
PG_Reserved is a quick way to trick it out of doing this, but that would leave
the crash kernel region un-initialised after resume, while kexec_crash_image
still has a value.
I think the best fix for this is to forbid hibernate if kexec_crash_loaded()
arguing these are mutually-exclusive features, and the protect crash-dump
feature exists to prevent things like hibernate corrupting the crash region.
Post by AKASHI Takahiro
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index bc96c8a7fc79..f7938fecf3ff 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -159,32 +171,20 @@ void machine_kexec(struct kimage *kimage)
kimage->control_code_page);
pr_debug("%s:%d: reboot_code_buffer_phys: %pa\n", __func__, __LINE__,
&reboot_code_buffer_phys);
- pr_debug("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__,
- reboot_code_buffer);
pr_debug("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__,
arm64_relocate_new_kernel);
pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
__func__, __LINE__, arm64_relocate_new_kernel_size,
arm64_relocate_new_kernel_size);
- /*
- * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
- * after the kernel is shut down.
- */
- memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
- arm64_relocate_new_kernel_size);
-
- /* Flush the reboot_code_buffer in preparation for its execution. */
- __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
- flush_icache_range((uintptr_t)reboot_code_buffer,
- arm64_relocate_new_kernel_size);
- /* Flush the kimage list and its buffers. */
- kexec_list_flush(kimage);
+ if (kimage != kexec_crash_image) {
+ /* Flush the kimage list and its buffers. */
+ kexec_list_flush(kimage);
- /* Flush the new image if already in place. */
- if (kimage->head & IND_DONE)
- kexec_segment_flush(kimage);
+ /* Flush the new image if already in place. */
+ if (kimage->head & IND_DONE)
+ kexec_segment_flush(kimage);
+ }
So for kdump we cleaned the kimage->segment[i].mem regions in
arch_kexec_protect_crashkres(), so don't need to do it here.

What about the kimage->head[i] array of list entries that were cleaned by
kexec_list_flush()? Now we don't clean that for kdump either, but we do pass it
Post by AKASHI Takahiro
cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head, kimage_start, 0);
Can we test the IND_DONE_BIT of kimage->head, so that we know that
arm64_relocate_new_kernel() won't try to walk the unclean list?
Alternatively we could call kexec_list_flush() in arch_kexec_protect_crashkres()
too.



Thanks,

James
AKASHI Takahiro
2017-01-26 11:28:12 UTC
Permalink
James,

I will try to revisit your comments later, but quick replies now
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.
We also have to
- put the region in an isolated mapping, and
- move copying kexec's control_code_page to machine_kexec_prepare()
so that the region will be completely read-only after loading.
Note that the region must reside in linear mapping and have corresponding
page structures in order to be potentially freed by shrinking it through
/sys/kernel/kexec_crash_size.
Nasty! Presumably you have to build the crash region out of individual page
mappings,
This might be an alternative, but
Post by James Morse
so that they can be returned to the slab-allocator one page at a time,
and still be able to set/clear the valid bits on the remaining chunk.
(I don't see how that happens in this patch)
As far as shrinking feature is concerned, I believe, crash_shrink_memory(),
which eventually calls free_reserved_page(), will take care of all the things
to do. I can see increased number of "MemFree" in /proc/meminfo.
(Please note that the region is memblock_reserve()'d at boot time.)
Post by James Morse
debug_pagealloc has to do this too so it can flip the valid bits one page at a
time. You could change the debug_pagealloc_enabled() value passed in at the top
__create_pgd_mapping() level to be a needs_per_page_mapping(addr, size) test
that happens as we build the linear map. (This would save the 3 extra calls to
__create_pgd_mapping() in __map_memblock())
I'm glad to see you can't resize the region if a crash kernel is loaded!
This secretly-unmapped is the sort of thing that breaks hibernate, it blindly
assumes pfn_valid() means it can access the page if it wants to. Setting
PG_Reserved is a quick way to trick it out of doing this, but that would leave
the crash kernel region un-initialised after resume, while kexec_crash_image
still has a value.
Ouch, I didn't notice this issue.
Post by James Morse
I think the best fix for this is to forbid hibernate if kexec_crash_loaded()
arguing these are mutually-exclusive features, and the protect crash-dump
feature exists to prevent things like hibernate corrupting the crash region.
This restriction is really painful.
Is there any hibernation hook that will be invoked before suspending and
after resuming? If so, arch_kexec_unprotect_crashkres()/protect_crashkres()
will be able to be called.

Or if "read-only (without unmapping)" approach would be acceptable,
those two features might be no longer mutually-exclusive.
Post by James Morse
Post by AKASHI Takahiro
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index bc96c8a7fc79..f7938fecf3ff 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -159,32 +171,20 @@ void machine_kexec(struct kimage *kimage)
kimage->control_code_page);
pr_debug("%s:%d: reboot_code_buffer_phys: %pa\n", __func__, __LINE__,
&reboot_code_buffer_phys);
- pr_debug("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__,
- reboot_code_buffer);
pr_debug("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__,
arm64_relocate_new_kernel);
pr_debug("%s:%d: relocate_new_kernel_size: 0x%lx(%lu) bytes\n",
__func__, __LINE__, arm64_relocate_new_kernel_size,
arm64_relocate_new_kernel_size);
- /*
- * Copy arm64_relocate_new_kernel to the reboot_code_buffer for use
- * after the kernel is shut down.
- */
- memcpy(reboot_code_buffer, arm64_relocate_new_kernel,
- arm64_relocate_new_kernel_size);
-
- /* Flush the reboot_code_buffer in preparation for its execution. */
- __flush_dcache_area(reboot_code_buffer, arm64_relocate_new_kernel_size);
- flush_icache_range((uintptr_t)reboot_code_buffer,
- arm64_relocate_new_kernel_size);
- /* Flush the kimage list and its buffers. */
- kexec_list_flush(kimage);
+ if (kimage != kexec_crash_image) {
+ /* Flush the kimage list and its buffers. */
+ kexec_list_flush(kimage);
- /* Flush the new image if already in place. */
- if (kimage->head & IND_DONE)
- kexec_segment_flush(kimage);
+ /* Flush the new image if already in place. */
+ if (kimage->head & IND_DONE)
+ kexec_segment_flush(kimage);
+ }
So for kdump we cleaned the kimage->segment[i].mem regions in
arch_kexec_protect_crashkres(), so don't need to do it here.
Correct.
Post by James Morse
What about the kimage->head[i] array of list entries that were cleaned by
kexec_list_flush()? Now we don't clean that for kdump either, but we do pass it
Post by AKASHI Takahiro
cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head, kimage_start, 0);
Kimage->head holds a list of memory regions that are overlapped
between the primary kernel and the secondary kernel, but in kedump case,
the whole memory is isolated and the list should be empty.

That is why kexec_list_flush() is skipped here, but yes,
"kimage->head" might better be cleaned anyway.
(I believe, from the past discussions, that cache coherency is still
maintained in kdump case though.)
Post by James Morse
Can we test the IND_DONE_BIT of kimage->head, so that we know that
arm64_relocate_new_kernel() won't try to walk the unclean list?
Alternatively we could call kexec_list_flush() in arch_kexec_protect_crashkres()
too.
So call kexec_list_flush() in machine_kexec() either in kexec or kdump.

Thanks,
-Takahiro AKASHI
Post by James Morse
Thanks,
James
James Morse
2017-01-27 11:19:32 UTC
Permalink
Hi Akashi,
Post by AKASHI Takahiro
Post by James Morse
Post by AKASHI Takahiro
To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.
We also have to
- put the region in an isolated mapping, and
- move copying kexec's control_code_page to machine_kexec_prepare()
so that the region will be completely read-only after loading.
Note that the region must reside in linear mapping and have corresponding
page structures in order to be potentially freed by shrinking it through
/sys/kernel/kexec_crash_size.
Nasty! Presumably you have to build the crash region out of individual page
mappings,
This might be an alternative, but
Post by James Morse
so that they can be returned to the slab-allocator one page at a time,
and still be able to set/clear the valid bits on the remaining chunk.
(I don't see how that happens in this patch)
As far as shrinking feature is concerned, I believe, crash_shrink_memory(),
which eventually calls free_reserved_page(), will take care of all the things
to do. I can see increased number of "MemFree" in /proc/meminfo.
Except for arch specific stuff like reformatting the page tables. Maybe I've
overlooked the way out this. What happens with this scenario:

We boot with crashkernel=1G on the commandline.
Memblock_reserve allocates a naturally aligned 1GB block of memory for the crash
region.
Your code in __map_memblock() calls __create_pgd_mapping() ->
alloc_init_pud() which decides use_1G_block() looks like a good idea.

Some time later, the user decides to free half of this region,
free_reserved_page() does its thing and half of those struct page's now belong
to the memory allocator.

Now we load a kdump kernel, which causes arch_kexec_protect_crashkres() to be
called for the 512MB region that was left.

create_mapping_late() needs to split the 1GB mapping it originally made into a
smaller table, with the first half using PAGE_KERNEL_INVALID, and the second
half using PAGE_KERNEL. It can't do break-before-make because these pages may be
in-use by another CPU because we gave them back to the memory allocator. (in the
worst-possible world, that second half contains our stack!)


Making this behave more like debug_pagealloc where the region is only built of
page-size mappings should avoid this. The smallest change to what you have is to
always pass page_mappings_only for the kdump region.

Ideally we just disable this resize feature for ARM64 and support it with some
later kernel version, but I can't see a way of doing this without adding Kconfig
symbols to other architectures.
Post by AKASHI Takahiro
(Please note that the region is memblock_reserve()'d at boot time.)
And free_reserved_page() does nothing to update memblock, so
memblock_is_reserved() says these pages are reserved, but in reality they
are in use by the memory allocator. This doesn't feel right.

(Fortunately we can override crash_free_reserved_phys_range() so this can
probably be fixed)
Post by AKASHI Takahiro
Post by James Morse
This secretly-unmapped is the sort of thing that breaks hibernate, it blindly
assumes pfn_valid() means it can access the page if it wants to. Setting
PG_Reserved is a quick way to trick it out of doing this, but that would leave
the crash kernel region un-initialised after resume, while kexec_crash_image
still has a value.
Ouch, I didn't notice this issue.
Post by James Morse
I think the best fix for this is to forbid hibernate if kexec_crash_loaded()
arguing these are mutually-exclusive features, and the protect crash-dump
feature exists to prevent things like hibernate corrupting the crash region.
This restriction is really painful.
Is there any hibernation hook that will be invoked before suspending and
after resuming? If so, arch_kexec_unprotect_crashkres()/protect_crashkres()
will be able to be called.
Those calls could go in swsusp_arch_suspend() in /arch/arm64/kernel/hibernate.c,
but isn't this protect feature supposed to stop things like hibernate from
meddling with the region?

(I haven't tested what hibernate does with the crash region as its only just
occurred to me)

I think to avoid holding kdump up we should disable any possible interaction,
(forbid hibernate if a kdump kernel is loaded), and sort it out later!
Post by AKASHI Takahiro
Post by James Morse
Post by AKASHI Takahiro
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index bc96c8a7fc79..f7938fecf3ff 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -159,32 +171,20 @@ void machine_kexec(struct kimage *kimage)
- /* Flush the kimage list and its buffers. */
- kexec_list_flush(kimage);
+ if (kimage != kexec_crash_image) {
+ /* Flush the kimage list and its buffers. */
+ kexec_list_flush(kimage);
- /* Flush the new image if already in place. */
- if (kimage->head & IND_DONE)
- kexec_segment_flush(kimage);
+ /* Flush the new image if already in place. */
+ if (kimage->head & IND_DONE)
+ kexec_segment_flush(kimage);
+ }
So for kdump we cleaned the kimage->segment[i].mem regions in
arch_kexec_protect_crashkres(), so don't need to do it here.
Correct.
Post by James Morse
What about the kimage->head[i] array of list entries that were cleaned by
kexec_list_flush()? Now we don't clean that for kdump either, but we do pass it
Post by AKASHI Takahiro
cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head, kimage_start, 0);
Kimage->head holds a list of memory regions that are overlapped
between the primary kernel and the secondary kernel, but in kedump case,
the whole memory is isolated and the list should be empty.
The asm code will still try to walk the list with MMU and caches turned off, so
even its "I'm empty" values need cleaning to the PoC.

(it looks like the first value is passed by value, so we could try and be clever
by testing for that DONE flag in the first value, but I don't think its worth
the effort)


Thanks,

James
AKASHI Takahiro
2017-01-27 17:15:16 UTC
Permalink
James,
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
Post by James Morse
Post by AKASHI Takahiro
To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.
We also have to
- put the region in an isolated mapping, and
- move copying kexec's control_code_page to machine_kexec_prepare()
so that the region will be completely read-only after loading.
Note that the region must reside in linear mapping and have corresponding
page structures in order to be potentially freed by shrinking it through
/sys/kernel/kexec_crash_size.
Nasty! Presumably you have to build the crash region out of individual page
mappings,
This might be an alternative, but
Post by James Morse
so that they can be returned to the slab-allocator one page at a time,
and still be able to set/clear the valid bits on the remaining chunk.
(I don't see how that happens in this patch)
As far as shrinking feature is concerned, I believe, crash_shrink_memory(),
which eventually calls free_reserved_page(), will take care of all the things
to do. I can see increased number of "MemFree" in /proc/meminfo.
Except for arch specific stuff like reformatting the page tables. Maybe I've
We boot with crashkernel=1G on the commandline.
Memblock_reserve allocates a naturally aligned 1GB block of memory for the crash
region.
Your code in __map_memblock() calls __create_pgd_mapping() ->
alloc_init_pud() which decides use_1G_block() looks like a good idea.
Some time later, the user decides to free half of this region,
free_reserved_page() does its thing and half of those struct page's now belong
to the memory allocator.
Now we load a kdump kernel, which causes arch_kexec_protect_crashkres() to be
called for the 512MB region that was left.
create_mapping_late() needs to split the 1GB mapping it originally made into a
smaller table, with the first half using PAGE_KERNEL_INVALID, and the second
half using PAGE_KERNEL. It can't do break-before-make because these pages may be
in-use by another CPU because we gave them back to the memory allocator. (in the
worst-possible world, that second half contains our stack!)
Yeah, this is a horrible case.
Now I understand why we should stick with page_mapping_only option.
Post by James Morse
Making this behave more like debug_pagealloc where the region is only built of
page-size mappings should avoid this. The smallest change to what you have is to
always pass page_mappings_only for the kdump region.
Ideally we just disable this resize feature for ARM64 and support it with some
later kernel version, but I can't see a way of doing this without adding Kconfig
symbols to other architectures.
Post by AKASHI Takahiro
(Please note that the region is memblock_reserve()'d at boot time.)
And free_reserved_page() does nothing to update memblock, so
memblock_is_reserved() says these pages are reserved, but in reality they
are in use by the memory allocator. This doesn't feel right.
Just FYI, no other architectures take care of this issue.

(and I don't know whether the memblock is reserved or not may have
any impact after booting.)
Post by James Morse
(Fortunately we can override crash_free_reserved_phys_range() so this can
probably be fixed)
Post by AKASHI Takahiro
Post by James Morse
This secretly-unmapped is the sort of thing that breaks hibernate, it blindly
assumes pfn_valid() means it can access the page if it wants to. Setting
PG_Reserved is a quick way to trick it out of doing this, but that would leave
the crash kernel region un-initialised after resume, while kexec_crash_image
still has a value.
Ouch, I didn't notice this issue.
Post by James Morse
I think the best fix for this is to forbid hibernate if kexec_crash_loaded()
arguing these are mutually-exclusive features, and the protect crash-dump
feature exists to prevent things like hibernate corrupting the crash region.
This restriction is really painful.
Is there any hibernation hook that will be invoked before suspending and
after resuming? If so, arch_kexec_unprotect_crashkres()/protect_crashkres()
will be able to be called.
Those calls could go in swsusp_arch_suspend() in /arch/arm64/kernel/hibernate.c,
I will give it a try next week.
Post by James Morse
but isn't this protect feature supposed to stop things like hibernate from
meddling with the region?
It seems that kexec code never expect that the crash kernel memory
is actually unmapped (as my current patch does).
Moreover, whether kdump or not, it is quit fragile to unmap some part of
linear mapping dynamically. I think we probably need to implement kinda
"memory hotplug" in order to perform such an unmapping without affecting
other kernel components.
Post by James Morse
(I haven't tested what hibernate does with the crash region as its only just
occurred to me)
I think to avoid holding kdump up we should disable any possible interaction,
(forbid hibernate if a kdump kernel is loaded), and sort it out later!
There are several options
- hibernate and kdump need be exclusively configured, or
- once kdump is loaded, hibernate will fail, or
- after resuming from hibernate, kdump won't work

The simplest way is to force users to re-load kdump after resuming,
but it sounds somewhat weird.
Post by James Morse
Post by AKASHI Takahiro
Post by James Morse
Post by AKASHI Takahiro
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index bc96c8a7fc79..f7938fecf3ff 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -159,32 +171,20 @@ void machine_kexec(struct kimage *kimage)
- /* Flush the kimage list and its buffers. */
- kexec_list_flush(kimage);
+ if (kimage != kexec_crash_image) {
+ /* Flush the kimage list and its buffers. */
+ kexec_list_flush(kimage);
- /* Flush the new image if already in place. */
- if (kimage->head & IND_DONE)
- kexec_segment_flush(kimage);
+ /* Flush the new image if already in place. */
+ if (kimage->head & IND_DONE)
+ kexec_segment_flush(kimage);
+ }
So for kdump we cleaned the kimage->segment[i].mem regions in
arch_kexec_protect_crashkres(), so don't need to do it here.
Correct.
Post by James Morse
What about the kimage->head[i] array of list entries that were cleaned by
kexec_list_flush()? Now we don't clean that for kdump either, but we do pass it
Post by AKASHI Takahiro
cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head, kimage_start, 0);
Kimage->head holds a list of memory regions that are overlapped
between the primary kernel and the secondary kernel, but in kedump case,
the whole memory is isolated and the list should be empty.
The asm code will still try to walk the list with MMU and caches turned off, so
even its "I'm empty" values need cleaning to the PoC.
(it looks like the first value is passed by value, so we could try and be clever
by testing for that DONE flag in the first value, but I don't think its worth
the effort)
Surely not.

-Takahiro AKASHI
Post by James Morse
Thanks,
James
Mark Rutland
2017-01-27 18:56:13 UTC
Permalink
Post by AKASHI Takahiro
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.
We also have to
- put the region in an isolated mapping, and
- move copying kexec's control_code_page to machine_kexec_prepare()
so that the region will be completely read-only after loading.
Note that the region must reside in linear mapping and have corresponding
page structures in order to be potentially freed by shrinking it through
/sys/kernel/kexec_crash_size.
Ah; I did not realise that this was a possibility.
Post by AKASHI Takahiro
Now I understand why we should stick with page_mapping_only option.
Likewise, I now agree.

Apologies for guiding you down the wrong path here.

Thanks,
Mark.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
AKASHI Takahiro
2017-01-30 08:42:33 UTC
Permalink
Mark,
Post by Mark Rutland
Post by AKASHI Takahiro
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.
We also have to
- put the region in an isolated mapping, and
- move copying kexec's control_code_page to machine_kexec_prepare()
so that the region will be completely read-only after loading.
Note that the region must reside in linear mapping and have corresponding
page structures in order to be potentially freed by shrinking it through
/sys/kernel/kexec_crash_size.
Ah; I did not realise that this was a possibility.
Post by AKASHI Takahiro
Now I understand why we should stick with page_mapping_only option.
Likewise, I now agree.
Apologies for guiding you down the wrong path here.
Your comments are always welcome.

Anyhow, I think we'd better have a dedicated function of unmapping.
Can you please take a look at the following hack?

(We need to carefully use this function except for kdump usage since
it doesn't care whether the region to be unmapped is used somewhere else.)

Thanks,
-Takahiro AKASHI

===8<===
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 2142c7726e76..945d84cd5df7 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -54,6 +54,7 @@
#define PAGE_KERNEL_ROX __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_RDONLY)
#define PAGE_KERNEL_EXEC __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE)
#define PAGE_KERNEL_EXEC_CONT __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_CONT)
+#define PAGE_KERNEL_INVALID __pgprot(0)

#define PAGE_HYP __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_HYP_XN)
#define PAGE_HYP_EXEC __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 17243e43184e..81173b594195 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -307,6 +307,101 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
} while (pgd++, addr = next, addr != end);
}

+static void free_pte(pmd_t *pmd, unsigned long addr, unsigned long end,
+ bool dealloc_table)
+{
+ pte_t *pte;
+ bool do_free = (dealloc_table && ((end - addr) == PMD_SIZE));
+
+ BUG_ON(pmd_none(*pmd) || pmd_bad(*pmd));
+
+ pte = pte_set_fixmap_offset(pmd, addr);
+ do {
+ pte_clear(NULL, NULL, pte);
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+
+ pte_clear_fixmap();
+
+ if (do_free) {
+ __free_page(pmd_page(*pmd));
+ pmd_clear(pmd);
+ }
+}
+
+static void free_pmd(pud_t *pud, unsigned long addr, unsigned long end,
+ bool dealloc_table)
+{
+ pmd_t *pmd;
+ unsigned long next;
+ bool do_free = (dealloc_table && ((end - addr) == PUD_SIZE));
+
+ BUG_ON(pud_none(*pud) || pud_bad(*pud));
+
+ pmd = pmd_set_fixmap_offset(pud, addr);
+
+ do {
+ next = pmd_addr_end(addr, end);
+
+ if (pmd_table(*pmd)) {
+ free_pte(pmd, addr, next, dealloc_table);
+ } else {
+ pmd_clear(pmd);
+ }
+ } while (pmd++, addr = next, addr != end);
+
+ pmd_clear_fixmap();
+
+ if (do_free) {
+ __free_page(pud_page(*pud));
+ pud_clear(pud);
+ }
+}
+
+static void free_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
+ bool dealloc_table)
+{
+ pud_t *pud;
+ unsigned long next;
+ bool do_free = (dealloc_table && ((end - addr) == PGDIR_SIZE));
+
+ BUG_ON(pgd_none(*pgd) || pgd_bad(*pgd));
+
+ pud = pud_set_fixmap_offset(pgd, addr);
+
+ do {
+ next = pud_addr_end(addr, end);
+
+ if (pud_table(*pud)) {
+ free_pmd(pud, addr, next, dealloc_table);
+ } else {
+ pud_clear(pud);
+ }
+ } while (pud++, addr = next, addr != end);
+
+ pud_clear_fixmap();
+
+ if (do_free) {
+ __free_page(pgd_page(*pgd));
+ pgd_clear(pgd);
+ }
+}
+
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long virt,
+ phys_addr_t size, bool dealloc_table)
+{
+ unsigned long addr, length, end, next;
+ pgd_t *pgd = pgd_offset_raw(pgdir, virt);
+
+ addr = virt & PAGE_MASK;
+ length = PAGE_ALIGN(size + (virt & ~PAGE_MASK));
+
+ end = addr + length;
+ do {
+ next = pgd_addr_end(addr, end);
+ free_pud(pgd, addr, next, dealloc_table);
+ } while (pgd++, addr = next, addr != end);
+}
+
static phys_addr_t pgd_pgtable_alloc(void)
{
void *ptr = (void *)__get_free_page(PGALLOC_GFP);
@@ -334,14 +429,15 @@ static void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL, false);
}

-void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
- unsigned long virt, phys_addr_t size,
- pgprot_t prot, bool page_mappings_only)
+void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
+ unsigned long virt, phys_addr_t size,
+ pgprot_t prot, bool page_mappings_only)
{
- BUG_ON(mm == &init_mm);
-
- __create_pgd_mapping(mm->pgd, phys, virt, size, prot,
- pgd_pgtable_alloc, page_mappings_only);
+ if (pgprot_val(prot))
+ __create_pgd_mapping(mm->pgd, phys, virt, size, prot,
+ pgd_pgtable_alloc, page_mappings_only);
+ else
+ __remove_pgd_mapping(mm->pgd, virt, size, true);
}

static void create_mapping_late(phys_addr_t phys, unsigned long virt,
===>8===
Post by Mark Rutland
Thanks,
Mark.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
AKASHI Takahiro
2017-01-30 08:27:51 UTC
Permalink
James,
Post by AKASHI Takahiro
James,
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
Post by James Morse
Post by AKASHI Takahiro
To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.
We also have to
- put the region in an isolated mapping, and
- move copying kexec's control_code_page to machine_kexec_prepare()
so that the region will be completely read-only after loading.
Note that the region must reside in linear mapping and have corresponding
page structures in order to be potentially freed by shrinking it through
/sys/kernel/kexec_crash_size.
Nasty! Presumably you have to build the crash region out of individual page
mappings,
This might be an alternative, but
Post by James Morse
so that they can be returned to the slab-allocator one page at a time,
and still be able to set/clear the valid bits on the remaining chunk.
(I don't see how that happens in this patch)
As far as shrinking feature is concerned, I believe, crash_shrink_memory(),
which eventually calls free_reserved_page(), will take care of all the things
to do. I can see increased number of "MemFree" in /proc/meminfo.
Except for arch specific stuff like reformatting the page tables. Maybe I've
We boot with crashkernel=1G on the commandline.
Memblock_reserve allocates a naturally aligned 1GB block of memory for the crash
region.
Your code in __map_memblock() calls __create_pgd_mapping() ->
alloc_init_pud() which decides use_1G_block() looks like a good idea.
Some time later, the user decides to free half of this region,
free_reserved_page() does its thing and half of those struct page's now belong
to the memory allocator.
Now we load a kdump kernel, which causes arch_kexec_protect_crashkres() to be
called for the 512MB region that was left.
create_mapping_late() needs to split the 1GB mapping it originally made into a
smaller table, with the first half using PAGE_KERNEL_INVALID, and the second
half using PAGE_KERNEL. It can't do break-before-make because these pages may be
in-use by another CPU because we gave them back to the memory allocator. (in the
worst-possible world, that second half contains our stack!)
Yeah, this is a horrible case.
Now I understand why we should stick with page_mapping_only option.
Post by James Morse
Making this behave more like debug_pagealloc where the region is only built of
page-size mappings should avoid this. The smallest change to what you have is to
always pass page_mappings_only for the kdump region.
Ideally we just disable this resize feature for ARM64 and support it with some
later kernel version, but I can't see a way of doing this without adding Kconfig
symbols to other architectures.
Post by AKASHI Takahiro
(Please note that the region is memblock_reserve()'d at boot time.)
And free_reserved_page() does nothing to update memblock, so
memblock_is_reserved() says these pages are reserved, but in reality they
are in use by the memory allocator. This doesn't feel right.
Just FYI, no other architectures take care of this issue.
(and I don't know whether the memblock is reserved or not may have
any impact after booting.)
Post by James Morse
(Fortunately we can override crash_free_reserved_phys_range() so this can
probably be fixed)
Post by AKASHI Takahiro
Post by James Morse
This secretly-unmapped is the sort of thing that breaks hibernate, it blindly
assumes pfn_valid() means it can access the page if it wants to. Setting
PG_Reserved is a quick way to trick it out of doing this, but that would leave
the crash kernel region un-initialised after resume, while kexec_crash_image
still has a value.
Ouch, I didn't notice this issue.
Post by James Morse
I think the best fix for this is to forbid hibernate if kexec_crash_loaded()
arguing these are mutually-exclusive features, and the protect crash-dump
feature exists to prevent things like hibernate corrupting the crash region.
This restriction is really painful.
Is there any hibernation hook that will be invoked before suspending and
after resuming? If so, arch_kexec_unprotect_crashkres()/protect_crashkres()
will be able to be called.
Those calls could go in swsusp_arch_suspend() in /arch/arm64/kernel/hibernate.c,
I will give it a try next week.
I took the following test scenario:
- load the crash dump kernel
- echo shutdown > /sys/power/disk
- echo disk > /sys/power/state
- power off and on the board
- reboot with resume=...
- after hibernate done, echo c > /proc/sysrq-trigger
- after reboot, check /proc/vmcore

and everything looks to work fine.

If you think I'd better try more test cases, please let me know.

Thanks,
-Takahiro AKASHI
===8<===
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index fe301cbcb442..111a849333ee 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -16,6 +16,7 @@
*/
#define pr_fmt(x) "hibernate: " x
#include <linux/cpu.h>
+#include <linux/kexec.h>
#include <linux/kvm_host.h>
#include <linux/mm.h>
#include <linux/pm.h>
@@ -289,6 +290,12 @@ int swsusp_arch_suspend(void)
local_dbg_save(flags);

if (__cpu_suspend_enter(&state)) {
+#ifdef CONFIG_KEXEC_CORE
+ /* make the crash dump kernel region mapped */
+ if (kexec_crash_image)
+ arch_kexec_unprotect_crashkres();
+#endif
+
sleep_cpu = smp_processor_id();
ret = swsusp_save();
} else {
@@ -300,6 +307,12 @@ int swsusp_arch_suspend(void)
if (el2_reset_needed())
dcache_clean_range(__hyp_idmap_text_start, __hyp_idmap_text_end);

+#ifdef CONFIG_KEXEC_CORE
+ /* make the crash dump kernel region unmapped */
+ if (kexec_crash_image)
+ arch_kexec_protect_crashkres();
+#endif
+
/*
* Tell the hibernation core that we've just restored
* the memory
James Morse
2017-01-27 13:59:05 UTC
Permalink
Hi Akashi,
Post by AKASHI Takahiro
To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.
We also have to
- put the region in an isolated mapping, and
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9c7adcce8e4e..2d4a0b68a852 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
@@ -367,6 +368,39 @@ static void __init __map_memblock(pgd_t *pgd, phys_addr_t start, phys_addr_t end
unsigned long kernel_start = __pa(_text);
unsigned long kernel_end = __pa(__init_begin);
+#ifdef CONFIG_KEXEC_CORE
+ /*
+ * While crash dump kernel memory is contained in a single memblock
+ * for now, it should appear in an isolated mapping so that we can
+ * independently unmap the region later.
+ */
+ if (crashk_res.end && crashk_res.start >= start &&
+ crashk_res.end <= end) {
+ if (crashk_res.start != start)
+ __create_pgd_mapping(pgd, start, __phys_to_virt(start),
+ crashk_res.start - start,
+ PAGE_KERNEL,
+ early_pgtable_alloc,
+ debug_pagealloc_enabled());
+
+ /* before kexec_load(), the region can be read-writable. */
+ __create_pgd_mapping(pgd, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ crashk_res.end - crashk_res.start + 1,
+ PAGE_KERNEL, early_pgtable_alloc,
+ debug_pagealloc_enabled());
+
+ if (crashk_res.end != end)
+ __create_pgd_mapping(pgd, crashk_res.end + 1,
+ __phys_to_virt(crashk_res.end + 1),
+ end - crashk_res.end - 1,
+ PAGE_KERNEL,
+ early_pgtable_alloc,
+ debug_pagealloc_enabled());
+ return;
Doesn't this mean we skip all the 'does this overlap with the kernel text' tests
that happen further down in this file?

(I think this can be fixed by replacing page_mappings_only with something
like needs_per_page_mapping(addr, size) called when we try to place a block
mapping. needs_per_page_mapping() can then take kdump and debug_pagealloc into
account.)


I see boot failures on v4.10-rc5, with this series (and your fixup diff for
patch 4). I'm using defconfig with 64K pages and 42bit VA on Juno. I pass
'crashkernel=1G':
[ 0.000000] efi: Getting EFI parameters from FDT:
[ 0.000000] efi: EFI v2.50 by ARM Juno EFI Nov 24 2015 12:36:35
[ 0.000000] efi: ACPI=0xf95b0000 ACPI 2.0=0xf95b0014 PROP=0xfe8db4d8
[ 0.000000] Reserving 1024MB of memory at 2560MB for crashkernel
[ 0.000000] cma: Failed to reserve 512 MiB
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] kernel BUG at ../arch/arm64/mm/mmu.c:118!
[ 0.000000] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 0.000000] Modules linked in:
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00012-gdd6fe39
0c85d #6873
[ 0.000000] Hardware name: ARM Juno development board (r1) (DT)
[ 0.000000] task: fffffc0008e2c900 task.stack: fffffc0008df0000
[ 0.000000] PC is at __create_pgd_mapping+0x2c8/0x2e8
[ 0.000000] LR is at paging_init+0x2b0/0x6fc
[ 0.000000] pc : [<fffffc0008096b20>] lr : [<fffffc0008ce5a60>] pstate: 60000
0c5
[ 0.000000] sp : fffffc0008df3e20
[ 0.000000] x29: fffffc0008df3e20 x28: 00e8000000000713
[ 0.000000] x27: 0000000000000001 x26: 00000000e00f0000
[ 0.000000] x25: fffffe00794a0000 x24: fffffe00794a0000
[ 0.000000] x23: fffffe00600f0000 x22: 00e80000e0000711
[ 0.000000] x21: 0000000000000000 x20: fffffdff7e5e8018
[ 0.000000] x19: 000000000000e00f x18: 0000000000000000
[ 0.000000] x17: fffffc0008f61cd0 x16: 0000000000000005
[ 0.000000] x15: 0000000880000000 x14: 00000000001fffff
[ 0.000000] x13: 00f8000000000713 x12: fffffc0008e3d000
[ 0.000000] x11: 0000000000000801 x10: 00e8000000000713
[ 0.000000] x9 : 0000000000000000 x8 : 0000000000001003
[ 0.000000] x7 : 0000000000000000 x6 : 0000000000000000
[ 0.000000] x5 : fffffc0008ce5744 x4 : 00e8000000000713
[ 0.000000] x3 : fffffe007949ffff x2 : fffffe00e00f0000
[ 0.000000] x1 : 00e8000000000713 x0 : 0000000000000001
[ 0.000000]
[ 0.000000] Process swapper (pid: 0, stack limit = 0xfffffc0008df0000)
[ 0.000000] Stack: (0xfffffc0008df3e20 to 0xfffffc0008df4000)

[ 0.000000] Call trace:

[ 0.000000] [<fffffc0008096b20>] __create_pgd_mapping+0x2c8/0x2e8
[ 0.000000] [<fffffc0008ce5a60>] paging_init+0x2b0/0x6fc
[ 0.000000] [<fffffc0008ce27d0>] setup_arch+0x1c0/0x5ac
[ 0.000000] [<fffffc0008ce0838>] start_kernel+0x70/0x394
[ 0.000000] [<fffffc0008ce01e8>] __primary_switched+0x64/0x6c
[ 0.000000] Code: f29fefe1 ea01001f 54ffff00 d4210000 (d4210000)
[ 0.000000] ---[ end trace 0000000000000000 ]---
[ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[ 0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!


Adding some debug shows the crash region was allocated as 0xa0000000:0xdfffffff,
and the first memblock to hit __map_memblock() was 0x80000000:0xe0000000.

This causes the end of the crash region to be padded, as 0xdfffffff!=0xe0000000,
but your 'crashk_res.end + 1' actually mapped 0xe0000000 to 0xe0000000 with 0
size. This causes __create_pgd_mapping() to choke when it next uses 0xe0000000
as a start address, as it evidently mapped something when given a 0 size.

You need to round-up crashk_res.end to a a page boundary if it isn't already
aligned.
Post by AKASHI Takahiro
+ }
+#endif
+
/*
* Take care not to create a writable alias for the
* read-only text and rodata sections of the kernel image.
Thanks,

James
AKASHI Takahiro
2017-01-27 15:42:20 UTC
Permalink
James,
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
To protect the memory reserved for crash dump kernel once after loaded,
arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with
permissions of the corresponding kernel mappings.
We also have to
- put the region in an isolated mapping, and
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9c7adcce8e4e..2d4a0b68a852 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
@@ -367,6 +368,39 @@ static void __init __map_memblock(pgd_t *pgd, phys_addr_t start, phys_addr_t end
unsigned long kernel_start = __pa(_text);
unsigned long kernel_end = __pa(__init_begin);
+#ifdef CONFIG_KEXEC_CORE
+ /*
+ * While crash dump kernel memory is contained in a single memblock
+ * for now, it should appear in an isolated mapping so that we can
+ * independently unmap the region later.
+ */
+ if (crashk_res.end && crashk_res.start >= start &&
+ crashk_res.end <= end) {
+ if (crashk_res.start != start)
+ __create_pgd_mapping(pgd, start, __phys_to_virt(start),
+ crashk_res.start - start,
+ PAGE_KERNEL,
+ early_pgtable_alloc,
+ debug_pagealloc_enabled());
+
+ /* before kexec_load(), the region can be read-writable. */
+ __create_pgd_mapping(pgd, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ crashk_res.end - crashk_res.start + 1,
+ PAGE_KERNEL, early_pgtable_alloc,
+ debug_pagealloc_enabled());
+
+ if (crashk_res.end != end)
+ __create_pgd_mapping(pgd, crashk_res.end + 1,
+ __phys_to_virt(crashk_res.end + 1),
+ end - crashk_res.end - 1,
+ PAGE_KERNEL,
+ early_pgtable_alloc,
+ debug_pagealloc_enabled());
+ return;
Doesn't this mean we skip all the 'does this overlap with the kernel text' tests
that happen further down in this file?
You're right. We should still ckeck the overlap against
[start..crashk_res.start] and [crashk_res.end+1..end].

(Using memblock_isolate_range() could simplify the code.)
Post by James Morse
(I think this can be fixed by replacing page_mappings_only with something
like needs_per_page_mapping(addr, size) called when we try to place a block
mapping. needs_per_page_mapping() can then take kdump and debug_pagealloc into
account.)
I see boot failures on v4.10-rc5, with this series (and your fixup diff for
patch 4). I'm using defconfig with 64K pages and 42bit VA on Juno. I pass
[ 0.000000] efi: EFI v2.50 by ARM Juno EFI Nov 24 2015 12:36:35
[ 0.000000] efi: ACPI=0xf95b0000 ACPI 2.0=0xf95b0014 PROP=0xfe8db4d8
[ 0.000000] Reserving 1024MB of memory at 2560MB for crashkernel
[ 0.000000] cma: Failed to reserve 512 MiB
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] kernel BUG at ../arch/arm64/mm/mmu.c:118!
[ 0.000000] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00012-gdd6fe39
0c85d #6873
[ 0.000000] Hardware name: ARM Juno development board (r1) (DT)
[ 0.000000] task: fffffc0008e2c900 task.stack: fffffc0008df0000
[ 0.000000] PC is at __create_pgd_mapping+0x2c8/0x2e8
[ 0.000000] LR is at paging_init+0x2b0/0x6fc
[ 0.000000] pc : [<fffffc0008096b20>] lr : [<fffffc0008ce5a60>] pstate: 60000
0c5
[ 0.000000] sp : fffffc0008df3e20
[ 0.000000] x29: fffffc0008df3e20 x28: 00e8000000000713
[ 0.000000] x27: 0000000000000001 x26: 00000000e00f0000
[ 0.000000] x25: fffffe00794a0000 x24: fffffe00794a0000
[ 0.000000] x23: fffffe00600f0000 x22: 00e80000e0000711
[ 0.000000] x21: 0000000000000000 x20: fffffdff7e5e8018
[ 0.000000] x19: 000000000000e00f x18: 0000000000000000
[ 0.000000] x17: fffffc0008f61cd0 x16: 0000000000000005
[ 0.000000] x15: 0000000880000000 x14: 00000000001fffff
[ 0.000000] x13: 00f8000000000713 x12: fffffc0008e3d000
[ 0.000000] x11: 0000000000000801 x10: 00e8000000000713
[ 0.000000] x9 : 0000000000000000 x8 : 0000000000001003
[ 0.000000] x7 : 0000000000000000 x6 : 0000000000000000
[ 0.000000] x5 : fffffc0008ce5744 x4 : 00e8000000000713
[ 0.000000] x3 : fffffe007949ffff x2 : fffffe00e00f0000
[ 0.000000] x1 : 00e8000000000713 x0 : 0000000000000001
[ 0.000000]
[ 0.000000] Process swapper (pid: 0, stack limit = 0xfffffc0008df0000)
[ 0.000000] Stack: (0xfffffc0008df3e20 to 0xfffffc0008df4000)
[ 0.000000] [<fffffc0008096b20>] __create_pgd_mapping+0x2c8/0x2e8
[ 0.000000] [<fffffc0008ce5a60>] paging_init+0x2b0/0x6fc
[ 0.000000] [<fffffc0008ce27d0>] setup_arch+0x1c0/0x5ac
[ 0.000000] [<fffffc0008ce0838>] start_kernel+0x70/0x394
[ 0.000000] [<fffffc0008ce01e8>] __primary_switched+0x64/0x6c
[ 0.000000] Code: f29fefe1 ea01001f 54ffff00 d4210000 (d4210000)
[ 0.000000] ---[ end trace 0000000000000000 ]---
[ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[ 0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
Adding some debug shows the crash region was allocated as 0xa0000000:0xdfffffff,
and the first memblock to hit __map_memblock() was 0x80000000:0xe0000000.
This causes the end of the crash region to be padded, as 0xdfffffff!=0xe0000000,
but your 'crashk_res.end + 1' actually mapped 0xe0000000 to 0xe0000000 with 0
size. This causes __create_pgd_mapping() to choke when it next uses 0xe0000000
as a start address, as it evidently mapped something when given a 0 size.
You need to round-up crashk_res.end to a a page boundary if it isn't already
aligned.
The start address is already enforced to be 2MB aligned and the size of
region (hence start + size) is also page-size aligned. (See patch#3)
Since crashk_res.end is 'inclusive,' the fix would be
Post by James Morse
Post by AKASHI Takahiro
+ if (crashk_res.end != end)
+ __create_pgd_mapping(pgd, crashk_res.end + 1,
To
if ((crashk_res.end + 1) < end)

Thanks,
-Takahiro AKASHI
Post by James Morse
Post by AKASHI Takahiro
+ }
+#endif
+
/*
* Take care not to create a writable alias for the
* read-only text and rodata sections of the kernel image.
Thanks,
James
Mark Rutland
2017-01-27 19:41:14 UTC
Permalink
Post by AKASHI Takahiro
Post by James Morse
Post by AKASHI Takahiro
+ /*
+ * While crash dump kernel memory is contained in a single memblock
+ * for now, it should appear in an isolated mapping so that we can
+ * independently unmap the region later.
+ */
+ if (crashk_res.end && crashk_res.start >= start &&
+ crashk_res.end <= end) {
+ if (crashk_res.start != start)
+ __create_pgd_mapping(pgd, start, __phys_to_virt(start),
+ crashk_res.start - start,
+ PAGE_KERNEL,
+ early_pgtable_alloc,
+ debug_pagealloc_enabled());
+
+ /* before kexec_load(), the region can be read-writable. */
+ __create_pgd_mapping(pgd, crashk_res.start,
+ __phys_to_virt(crashk_res.start),
+ crashk_res.end - crashk_res.start + 1,
+ PAGE_KERNEL, early_pgtable_alloc,
+ debug_pagealloc_enabled());
+
+ if (crashk_res.end != end)
+ __create_pgd_mapping(pgd, crashk_res.end + 1,
+ __phys_to_virt(crashk_res.end + 1),
+ end - crashk_res.end - 1,
+ PAGE_KERNEL,
+ early_pgtable_alloc,
+ debug_pagealloc_enabled());
+ return;
Doesn't this mean we skip all the 'does this overlap with the kernel text' tests
that happen further down in this file?
You're right. We should still ckeck the overlap against
[start..crashk_res.start] and [crashk_res.end+1..end].
(Using memblock_isolate_range() could simplify the code.)
My key concern here was that we handle both of these in the same way, so
using memblock_isolate_range() for both generally sounds fine to me.

One concern I had with using memblock_isolate_range() is that it does
not guarantee that the region remains isolated. So if there was a
subsequent memblock_add() call, memblock_merge_regions() might end up
merging the region with an adjacent region.

If we isolate the regions at the start of map_mem(), and have a comment
explaining that we must avoid subsequent memblock merging, then I think
this would be ok.

Thanks,
Mark.
Loading...