Discussion:
[PATCH v33 00/14] add kdump support
AKASHI Takahiro
2017-03-15 09:56:56 UTC
Permalink
This patch series adds kdump support on arm64.

To load a crash-dump kernel to the systems, a series of patches to
kexec-tools[1] are also needed. Please use the latest one, v6 [2].
For your convinience, you can pick them up from:
https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump

To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
- crash utility (v7.1.8 or later) [3]

I tested this patchset on fast model and hikey.

The previous versions were also:
Tested-by: Pratyush Anand <***@redhat.com> (v32, mustang and seattle)
Tested-by: James Morse <***@arm.com> (v27/v32?, Juno)
Tested-by: Sameer Goel (v32, QDT2400)

Changes for v33 (Mar 15, 2017)
o rebased to v4.11-rc2+
o arch_kexec_(un)protect_crashkres() now protects loaded data segments
only along with moving copying of control_code_page back to machine_kexec()
(patch #6)
o reduce the size of hibernation image when kdump and hibernation are
comfigured at the same time (patch #7)
o clearify that "linux,usable-memory-range" and "linux,elfcorehdr"
have values of the size of root node's "#address-cells" and "#size-cells"
(patch #13)
o add "efi/libstub/arm*: Set default address and size cells values for
an empty dtb" from Sameer Goel (patch #14)
(I didn't test the case though.)

Changes for v32 (Feb 7, 2017)
o isolate crash dump kernel memory as well as kernel text/data by using
MEMBLOCK_MAP attribute to and then specifically map them in map_mem()
(patch #1,6)
o delete remove_pgd_mapping() and instead modify create_pgd_mapping() to
allowing for unmapping a kernel mapping (patch #5)
o correct a commit message as well as a comment in the source (patch#10)
o other trivial changes after Mark's comments (patch#3,4)

Changes for v31 (Feb 1, 2017)
o add/use remove_pgd_mapping() instead of modifying (__)create_pgd_mapping()
to protect crash dump kernel memory (patch #4,5)
o fix an issue at the isolation of crash dump kernel memory in
map_mem()/__map_memblock(), adding map_crashkernel() (patch#5)
o preserve the contents of crash dump kernel memory around hibernation
(patch#6)

Changes for v30 (Jan 24, 2017)
o rebased to Linux-v4.10-rc5
o remove "linux,crashkernel-base/size" from exported device tree
o protect memory region for crash-dump kernel (adding patch#4,5)
o remove "in_crash_kexec" variable
o and other trivial changes

Changes for v29 (Dec 28, 2016)
o rebased to Linux-v4.10-rc1
o change asm constraints in crash_setup_regs() per Catalin

Changes for v28 (Nov 22, 2016)
o rebased to Linux-v4.9-rc6
o revamp patch #1 and merge memblock_cap_memory_range() with
memblock_mem_limit_remove_map()

Changes for v27 (Nov 1, 2016)
o rebased to Linux-v4.9-rc3
o revert v26 change, i.e. revive "linux,usable-memory-range" property
(patch #2/#3, updating patch #9)
o minor fixes per review comments (patch #3/#4/#6/#8)
o re-order patches and improve commit messages for readability

Changes for v26 (Sep 7, 2016):
o Use /reserved-memory instead of "linux,usable-memory-range" property
(dropping v25's patch#2 and #3, updating ex-patch#9.)

Changes for v25 (Aug 29, 2016):
o Rebase to Linux-4.8-rc4
o Use memremap() instead of ioremap_cache() [patch#5]

Changes for v24 (Aug 9, 2016):
o Rebase to Linux-4.8-rc1
o Update descriptions about newly added DT proerties

Changes for v23 (July 26, 2016):

o Move memblock_reserve() to a single place in reserve_crashkernel()
o Use cpu_park_loop() in ipi_cpu_crash_stop()
o Always enforce ARCH_LOW_ADDRESS_LIMIT to the memory range of crash kernel
o Re-implement fdt_enforce_memory_region() to remove non-reserve regions
(for ACPI) from usable memory at crash kernel

Changes for v22 (July 12, 2016):

o Export "crashkernel-base" and "crashkernel-size" via device-tree,
and add some descriptions about them in chosen.txt
o Rename "usable-memory" to "usable-memory-range" to avoid inconsistency
with powerpc's "usable-memory"
o Make cosmetic changes regarding "ifdef" usage
o Correct some wordings in kdump.txt

Changes for v21 (July 6, 2016):

o Remove kexec patches.
o Rebase to arm64's for-next/core (Linux-4.7-rc4 based).
o Clarify the description about kvm in kdump.txt.

See the link [4] for older changes.


[1] https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
[2] http://lists.infradead.org/pipermail/kexec/2017-March/018356.html
[3] https://github.com/crash-utility/crash.git
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-June/438780.html

AKASHI Takahiro (12):
memblock: add memblock_clear_nomap()
memblock: add memblock_cap_memory_range()
arm64: limit memory regions based on DT property, usable-memory-range
arm64: kdump: reserve memory for crash dump kernel
arm64: mm: allow for unmapping part of kernel mapping
arm64: kdump: protect crash dump kernel memory
arm64: hibernate: preserve kdump image around hibernation
arm64: kdump: implement machine_crash_shutdown()
arm64: kdump: add VMCOREINFO's for user-space tools
arm64: kdump: provide /proc/vmcore file
arm64: kdump: enable kdump in defconfig
Documentation: kdump: describe arm64 port

James Morse (1):
Documentation: dt: chosen properties for arm64 kdump

Sameer Goel (1):
efi/libstub/arm*: Set default address and size cells values for an
empty dtb

Documentation/devicetree/bindings/chosen.txt | 45 +++++++
Documentation/kdump/kdump.txt | 16 ++-
arch/arm64/Kconfig | 11 ++
arch/arm64/configs/defconfig | 1 +
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 52 +++++++-
arch/arm64/include/asm/pgtable-prot.h | 1 +
arch/arm64/include/asm/smp.h | 2 +
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/crash_dump.c | 71 +++++++++++
arch/arm64/kernel/hibernate.c | 10 +-
arch/arm64/kernel/machine_kexec.c | 170 +++++++++++++++++++++++--
arch/arm64/kernel/setup.c | 7 +-
arch/arm64/kernel/smp.c | 63 ++++++++++
arch/arm64/mm/init.c | 181 +++++++++++++++++++++++++++
arch/arm64/mm/mmu.c | 107 ++++++++--------
drivers/firmware/efi/libstub/fdt.c | 28 ++++-
include/linux/memblock.h | 2 +
mm/memblock.c | 56 ++++++---
19 files changed, 745 insertions(+), 81 deletions(-)
create mode 100644 arch/arm64/kernel/crash_dump.c
--
2.11.1
AKASHI Takahiro
2017-03-15 09:59:00 UTC
Permalink
This function, with a combination of memblock_mark_nomap(), will be used
in a later kdump patch for arm64 when it temporarily isolates some range
of memory from the other memory blocks in order to create a specific
kernel mapping at boot time.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 12 ++++++++++++
2 files changed, 13 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bdfc65af4152..e82daffcfc44 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -93,6 +93,7 @@ int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
ulong choose_memblock_flags(void);

/* Low level functions */
diff --git a/mm/memblock.c b/mm/memblock.c
index 696f06d17c4e..2f4ca8104ea4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -805,6 +805,18 @@ int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size)
}

/**
+ * memblock_clear_nomap - Clear flag MEMBLOCK_NOMAP for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
+}
+
+/**
* __next_reserved_mem_region - next function for for_each_reserved_region()
* @idx: pointer to u64 loop variable
* @out_start: ptr to phys_addr_t for start address of the region, can be %NULL
--
2.11.1
AKASHI Takahiro
2017-03-15 09:59:34 UTC
Permalink
create_pgd_mapping() is enhanced here so that it will accept
PAGE_KERNEL_INVALID protection attribute and unmap a given range of memory.

The feature will be used in a later kdump patch to implement the protection
against possible corruption of crash dump kernel memory which is to be set
aside from ther other memory on primary kernel.

Note that, in this implementation, it assumes that all the range of memory
to be processed is mapped in page-level since the only current user is
kdump where page mappings are also required.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
---
arch/arm64/include/asm/pgtable-prot.h | 1 +
arch/arm64/mm/mmu.c | 17 +++++++++++------
2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 2142c7726e76..945d84cd5df7 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -54,6 +54,7 @@
#define PAGE_KERNEL_ROX __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_RDONLY)
#define PAGE_KERNEL_EXEC __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE)
#define PAGE_KERNEL_EXEC_CONT __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_CONT)
+#define PAGE_KERNEL_INVALID __pgprot(0)

#define PAGE_HYP __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_HYP_XN)
#define PAGE_HYP_EXEC __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d28dbcf596b6..cb359a3927ef 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -128,7 +128,10 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
do {
pte_t old_pte = *pte;

- set_pte(pte, pfn_pte(pfn, prot));
+ if (pgprot_val(prot))
+ set_pte(pte, pfn_pte(pfn, prot));
+ else
+ pte_clear(null, null, pte);
pfn++;

/*
@@ -309,12 +312,14 @@ static void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL, false);
}

-void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
- unsigned long virt, phys_addr_t size,
- pgprot_t prot, bool page_mappings_only)
+/*
+ * Note that PAGE_KERNEL_INVALID should be used with page_mappings_only
+ * true for now.
+ */
+void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
+ unsigned long virt, phys_addr_t size,
+ pgprot_t prot, bool page_mappings_only)
{
- BUG_ON(mm == &init_mm);
-
__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
pgd_pgtable_alloc, page_mappings_only);
}
--
2.11.1
James Morse
2017-03-21 10:35:53 UTC
Permalink
Hi Akashi,
Post by AKASHI Takahiro
create_pgd_mapping() is enhanced here so that it will accept
PAGE_KERNEL_INVALID protection attribute and unmap a given range of memory.
The feature will be used in a later kdump patch to implement the protection
against possible corruption of crash dump kernel memory which is to be set
aside from ther other memory on primary kernel.
Nit: ther- > the
Post by AKASHI Takahiro
Note that, in this implementation, it assumes that all the range of memory
to be processed is mapped in page-level since the only current user is
kdump where page mappings are also required.
Using create_pgd_mapping() like this means the mappings will be updated via the
fixmap which is unnecessary as the page tables will be part of mapped memory. In
the worst case this adds an extra tlbi for every 2MB of crash image when we map
or unmap it. I don't think this matters.

This code used to be __init and it is the only user of FIX_PTE, so there won't
be any existing runtime users. The two arch_kexec_unprotect_crashkres() calls in
kexec are both protected by the kexec_mutex, and the call in hibernate happens
after disable_nonboot_cpus(), so these callers can't race with each other.

This looks safe to me.
Post by AKASHI Takahiro
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d28dbcf596b6..cb359a3927ef 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -128,7 +128,10 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
do {
pte_t old_pte = *pte;
- set_pte(pte, pfn_pte(pfn, prot));
+ if (pgprot_val(prot))
+ set_pte(pte, pfn_pte(pfn, prot));
+ else
+ pte_clear(null, null, pte);
Lowercase NULLs? This relies on these values never being used... __set_fixmap()
in the same file passes &init_mm and the address, can we do the same to be
consistent?
Post by AKASHI Takahiro
pfn++;
/*
Reviewed-by: James Morse <***@arm.com>


Thanks,

James
AKASHI Takahiro
2017-03-23 11:43:10 UTC
Permalink
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
create_pgd_mapping() is enhanced here so that it will accept
PAGE_KERNEL_INVALID protection attribute and unmap a given range of memory.
The feature will be used in a later kdump patch to implement the protection
against possible corruption of crash dump kernel memory which is to be set
aside from ther other memory on primary kernel.
Nit: ther- > the
Fix it.
Post by James Morse
Post by AKASHI Takahiro
Note that, in this implementation, it assumes that all the range of memory
to be processed is mapped in page-level since the only current user is
kdump where page mappings are also required.
Using create_pgd_mapping() like this means the mappings will be updated via the
fixmap which is unnecessary as the page tables will be part of mapped memory. In
This might be a reason that we would go for (un)map_kernel_range()
over create_pgd_mapping() (? not sure)
Post by James Morse
the worst case this adds an extra tlbi for every 2MB of crash image when we map
or unmap it. I don't think this matters.
This code used to be __init and it is the only user of FIX_PTE, so there won't
be any existing runtime users. The two arch_kexec_unprotect_crashkres() calls in
kexec are both protected by the kexec_mutex, and the call in hibernate happens
after disable_nonboot_cpus(), so these callers can't race with each other.
This looks safe to me.
Post by AKASHI Takahiro
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d28dbcf596b6..cb359a3927ef 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -128,7 +128,10 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
do {
pte_t old_pte = *pte;
- set_pte(pte, pfn_pte(pfn, prot));
+ if (pgprot_val(prot))
+ set_pte(pte, pfn_pte(pfn, prot));
+ else
+ pte_clear(null, null, pte);
Lowercase NULLs? This relies on these values never being used... __set_fixmap()
in the same file passes &init_mm and the address, can we do the same to be
consistent?
OK.

Thanks,
-Takahiro AKASHI
Post by James Morse
Post by AKASHI Takahiro
pfn++;
/*
Thanks,
James
Ard Biesheuvel
2017-03-24 10:57:02 UTC
Permalink
Post by AKASHI Takahiro
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
create_pgd_mapping() is enhanced here so that it will accept
PAGE_KERNEL_INVALID protection attribute and unmap a given range of memory.
The feature will be used in a later kdump patch to implement the protection
against possible corruption of crash dump kernel memory which is to be set
aside from ther other memory on primary kernel.
Nit: ther- > the
Fix it.
Post by James Morse
Post by AKASHI Takahiro
Note that, in this implementation, it assumes that all the range of memory
to be processed is mapped in page-level since the only current user is
kdump where page mappings are also required.
Using create_pgd_mapping() like this means the mappings will be updated via the
fixmap which is unnecessary as the page tables will be part of mapped memory. In
This might be a reason that we would go for (un)map_kernel_range()
over create_pgd_mapping() (? not sure)
Yes, that is why I suggested it. We already use it to unmap the init
segment at the end of boot, but I do take your point about it being
documented as operating on kernel VMAs only.

Looking at the code, it shouldn't matter (it does not touch or reason
about VMAs at all, it only walks the page tables and unmaps them), but
I agree it is better not to rely on that.

But instead of clearing all permissions, which apparently requires
changes to alloc_init_pte(), and introduces the restriction that the
region should be mapped down to pages, could we not simply clear
PTE_VALID on the region, like we do for debug_pagealloc()?
Post by AKASHI Takahiro
Post by James Morse
the worst case this adds an extra tlbi for every 2MB of crash image when we map
or unmap it. I don't think this matters.
This code used to be __init and it is the only user of FIX_PTE, so there won't
be any existing runtime users. The two arch_kexec_unprotect_crashkres() calls in
kexec are both protected by the kexec_mutex, and the call in hibernate happens
after disable_nonboot_cpus(), so these callers can't race with each other.
This looks safe to me.
Post by AKASHI Takahiro
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d28dbcf596b6..cb359a3927ef 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -128,7 +128,10 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
do {
pte_t old_pte = *pte;
- set_pte(pte, pfn_pte(pfn, prot));
+ if (pgprot_val(prot))
+ set_pte(pte, pfn_pte(pfn, prot));
+ else
+ pte_clear(null, null, pte);
Lowercase NULLs? This relies on these values never being used... __set_fixmap()
in the same file passes &init_mm and the address, can we do the same to be
consistent?
OK.
Thanks,
-Takahiro AKASHI
Post by James Morse
Post by AKASHI Takahiro
pfn++;
/*
Thanks,
James
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
AKASHI Takahiro
2017-03-27 13:49:16 UTC
Permalink
Ard,
Post by Ard Biesheuvel
Post by AKASHI Takahiro
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
create_pgd_mapping() is enhanced here so that it will accept
PAGE_KERNEL_INVALID protection attribute and unmap a given range of memory.
The feature will be used in a later kdump patch to implement the protection
against possible corruption of crash dump kernel memory which is to be set
aside from ther other memory on primary kernel.
Nit: ther- > the
Fix it.
Post by James Morse
Post by AKASHI Takahiro
Note that, in this implementation, it assumes that all the range of memory
to be processed is mapped in page-level since the only current user is
kdump where page mappings are also required.
Using create_pgd_mapping() like this means the mappings will be updated via the
fixmap which is unnecessary as the page tables will be part of mapped memory. In
This might be a reason that we would go for (un)map_kernel_range()
over create_pgd_mapping() (? not sure)
Yes, that is why I suggested it. We already use it to unmap the init
segment at the end of boot, but I do take your point about it being
documented as operating on kernel VMAs only.
Looking at the code, it shouldn't matter (it does not touch or reason
about VMAs at all, it only walks the page tables and unmaps them), but
I agree it is better not to rely on that.
OK
Post by Ard Biesheuvel
But instead of clearing all permissions, which apparently requires
changes to alloc_init_pte(), and introduces the restriction that the
region should be mapped down to pages, could we not simply clear
PTE_VALID on the region, like we do for debug_pagealloc()?
Now that we are only using page-level mappings for crash kernel memory,
__change_page_common() should work and it even has no concerns that
James pointed out.
I will update my patch soon.

Thanks,
-Takahiro AKASHI
Post by Ard Biesheuvel
Post by AKASHI Takahiro
Post by James Morse
the worst case this adds an extra tlbi for every 2MB of crash image when we map
or unmap it. I don't think this matters.
This code used to be __init and it is the only user of FIX_PTE, so there won't
be any existing runtime users. The two arch_kexec_unprotect_crashkres() calls in
kexec are both protected by the kexec_mutex, and the call in hibernate happens
after disable_nonboot_cpus(), so these callers can't race with each other.
This looks safe to me.
Post by AKASHI Takahiro
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d28dbcf596b6..cb359a3927ef 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -128,7 +128,10 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
do {
pte_t old_pte = *pte;
- set_pte(pte, pfn_pte(pfn, prot));
+ if (pgprot_val(prot))
+ set_pte(pte, pfn_pte(pfn, prot));
+ else
+ pte_clear(null, null, pte);
Lowercase NULLs? This relies on these values never being used... __set_fixmap()
in the same file passes &init_mm and the address, can we do the same to be
consistent?
OK.
Thanks,
-Takahiro AKASHI
Post by James Morse
Post by AKASHI Takahiro
pfn++;
/*
Thanks,
James
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel
2017-03-21 11:16:34 UTC
Permalink
Post by AKASHI Takahiro
create_pgd_mapping() is enhanced here so that it will accept
PAGE_KERNEL_INVALID protection attribute and unmap a given range of memory.
The feature will be used in a later kdump patch to implement the protection
against possible corruption of crash dump kernel memory which is to be set
aside from ther other memory on primary kernel.
Note that, in this implementation, it assumes that all the range of memory
to be processed is mapped in page-level since the only current user is
kdump where page mappings are also required.
Couldn't we use unmap_kernel_range() for this?
Post by AKASHI Takahiro
---
arch/arm64/include/asm/pgtable-prot.h | 1 +
arch/arm64/mm/mmu.c | 17 +++++++++++------
2 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 2142c7726e76..945d84cd5df7 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -54,6 +54,7 @@
#define PAGE_KERNEL_ROX __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_RDONLY)
#define PAGE_KERNEL_EXEC __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE)
#define PAGE_KERNEL_EXEC_CONT __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_CONT)
+#define PAGE_KERNEL_INVALID __pgprot(0)
#define PAGE_HYP __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_HYP_XN)
#define PAGE_HYP_EXEC __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d28dbcf596b6..cb359a3927ef 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -128,7 +128,10 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
do {
pte_t old_pte = *pte;
- set_pte(pte, pfn_pte(pfn, prot));
+ if (pgprot_val(prot))
+ set_pte(pte, pfn_pte(pfn, prot));
+ else
+ pte_clear(null, null, pte);
pfn++;
/*
@@ -309,12 +312,14 @@ static void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL, false);
}
-void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
- unsigned long virt, phys_addr_t size,
- pgprot_t prot, bool page_mappings_only)
+/*
+ * Note that PAGE_KERNEL_INVALID should be used with page_mappings_only
+ * true for now.
+ */
+void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
+ unsigned long virt, phys_addr_t size,
+ pgprot_t prot, bool page_mappings_only)
{
- BUG_ON(mm == &init_mm);
-
__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
pgd_pgtable_alloc, page_mappings_only);
}
--
2.11.1
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
AKASHI Takahiro
2017-03-23 10:56:19 UTC
Permalink
Ard,
Post by Ard Biesheuvel
Post by AKASHI Takahiro
create_pgd_mapping() is enhanced here so that it will accept
PAGE_KERNEL_INVALID protection attribute and unmap a given range of memory.
The feature will be used in a later kdump patch to implement the protection
against possible corruption of crash dump kernel memory which is to be set
aside from ther other memory on primary kernel.
Note that, in this implementation, it assumes that all the range of memory
to be processed is mapped in page-level since the only current user is
kdump where page mappings are also required.
Couldn't we use unmap_kernel_range() for this?
I've almost forgotten this function, but my understanding is that
this function (and map counterpart) is mainly for "VM area" (< PAGE_OFFSET).
While it seems to (actually does) work instead of create_pgd_mapping() for now,
I'm not sure whether it is an expected usage (now and future).

So I think that it would be safe to keep using create_pgd_mapping().

Thanks,
-Takahiro AKASHI
Post by Ard Biesheuvel
Post by AKASHI Takahiro
---
arch/arm64/include/asm/pgtable-prot.h | 1 +
arch/arm64/mm/mmu.c | 17 +++++++++++------
2 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 2142c7726e76..945d84cd5df7 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -54,6 +54,7 @@
#define PAGE_KERNEL_ROX __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_RDONLY)
#define PAGE_KERNEL_EXEC __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE)
#define PAGE_KERNEL_EXEC_CONT __pgprot(_PAGE_DEFAULT | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_CONT)
+#define PAGE_KERNEL_INVALID __pgprot(0)
#define PAGE_HYP __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_HYP_XN)
#define PAGE_HYP_EXEC __pgprot(_PAGE_DEFAULT | PTE_HYP | PTE_RDONLY)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d28dbcf596b6..cb359a3927ef 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -128,7 +128,10 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
do {
pte_t old_pte = *pte;
- set_pte(pte, pfn_pte(pfn, prot));
+ if (pgprot_val(prot))
+ set_pte(pte, pfn_pte(pfn, prot));
+ else
+ pte_clear(null, null, pte);
pfn++;
/*
@@ -309,12 +312,14 @@ static void __init create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
__create_pgd_mapping(init_mm.pgd, phys, virt, size, prot, NULL, false);
}
-void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
- unsigned long virt, phys_addr_t size,
- pgprot_t prot, bool page_mappings_only)
+/*
+ * Note that PAGE_KERNEL_INVALID should be used with page_mappings_only
+ * true for now.
+ */
+void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
+ unsigned long virt, phys_addr_t size,
+ pgprot_t prot, bool page_mappings_only)
{
- BUG_ON(mm == &init_mm);
-
__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
pgd_pgtable_alloc, page_mappings_only);
}
--
2.11.1
_______________________________________________
linux-arm-kernel mailing list
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
AKASHI Takahiro
2017-03-15 09:59:35 UTC
Permalink
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres()
are meant to be called by kexec_load() in order to protect the memory
allocated for crash dump kernel once the image is loaded.

The protection is implemented by unmapping the relevant segments in crash
dump kernel memory, rather than making it read-only as other archs do,
to prevent any corruption due to potential cache alias (with different
attributes) problem.

Page-level mappings are consistently used here so that we can change
the attributes of segments in page granularity as well as shrink the region
also in page granularity through /sys/kernel/kexec_crash_size and putting
the freed memory back to buddy system.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
---
arch/arm64/kernel/machine_kexec.c | 35 ++++++++++++---
arch/arm64/mm/mmu.c | 90 ++++++++++++++++++++-------------------
2 files changed, 75 insertions(+), 50 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index bc96c8a7fc79..02e4f929db3b 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -14,6 +14,7 @@

#include <asm/cacheflush.h>
#include <asm/cpu_ops.h>
+#include <asm/mmu.h>
#include <asm/mmu_context.h>

#include "cpu-reset.h"
@@ -22,8 +23,6 @@
extern const unsigned char arm64_relocate_new_kernel[];
extern const unsigned long arm64_relocate_new_kernel_size;

-static unsigned long kimage_start;
-
/**
* kexec_image_info - For debugging output.
*/
@@ -64,8 +63,6 @@ void machine_kexec_cleanup(struct kimage *kimage)
*/
int machine_kexec_prepare(struct kimage *kimage)
{
- kimage_start = kimage->start;
-
kexec_image_info(kimage);

if (kimage->type != KEXEC_TYPE_CRASH && cpus_are_stuck_in_kernel()) {
@@ -183,7 +180,7 @@ void machine_kexec(struct kimage *kimage)
kexec_list_flush(kimage);

/* Flush the new image if already in place. */
- if (kimage->head & IND_DONE)
+ if ((kimage != kexec_crash_image) && (kimage->head & IND_DONE))
kexec_segment_flush(kimage);

pr_info("Bye!\n");
@@ -201,7 +198,7 @@ void machine_kexec(struct kimage *kimage)
*/

cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head,
- kimage_start, 0);
+ kimage->start, 0);

BUG(); /* Should never get here. */
}
@@ -210,3 +207,29 @@ void machine_crash_shutdown(struct pt_regs *regs)
{
/* Empty routine needed to avoid build errors. */
}
+
+void arch_kexec_protect_crashkres(void)
+{
+ int i;
+
+ kexec_segment_flush(kexec_crash_image);
+
+ for (i = 0; i < kexec_crash_image->nr_segments; i++)
+ create_pgd_mapping(&init_mm, kexec_crash_image->segment[i].mem,
+ __phys_to_virt(kexec_crash_image->segment[i].mem),
+ kexec_crash_image->segment[i].memsz,
+ PAGE_KERNEL_INVALID, true);
+
+
+ flush_tlb_all();
+}
+
+void arch_kexec_unprotect_crashkres(void)
+{
+ int i;
+
+ for (i = 0; i < kexec_crash_image->nr_segments; i++)
+ create_pgd_mapping(&init_mm, kexec_crash_image->segment[i].mem,
+ __phys_to_virt(kexec_crash_image->segment[i].mem),
+ kexec_crash_image->segment[i].memsz, PAGE_KERNEL, true);
+}
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index cb359a3927ef..51aca31cd9b7 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,8 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/ioport.h>
+#include <linux/kexec.h>
#include <linux/libfdt.h>
#include <linux/mman.h>
#include <linux/nodemask.h>
@@ -337,56 +339,31 @@ static void create_mapping_late(phys_addr_t phys, unsigned long virt,
NULL, debug_pagealloc_enabled());
}

-static void __init __map_memblock(pgd_t *pgd, phys_addr_t start, phys_addr_t end)
+static void __init __map_memblock(pgd_t *pgd, phys_addr_t start,
+ phys_addr_t end, pgprot_t prot,
+ bool page_mappings_only)
+{
+ __create_pgd_mapping(pgd, start, __phys_to_virt(start), end - start,
+ prot, early_pgtable_alloc,
+ page_mappings_only);
+}
+
+static void __init map_mem(pgd_t *pgd)
{
phys_addr_t kernel_start = __pa_symbol(_text);
phys_addr_t kernel_end = __pa_symbol(__init_begin);
+ struct memblock_region *reg;

/*
- * Take care not to create a writable alias for the
- * read-only text and rodata sections of the kernel image.
+ * Temporarily marked as NOMAP to skip mapping in the next for-loop
*/
+ memblock_mark_nomap(kernel_start, kernel_end - kernel_start);

- /* No overlap with the kernel text/rodata */
- if (end < kernel_start || start >= kernel_end) {
- __create_pgd_mapping(pgd, start, __phys_to_virt(start),
- end - start, PAGE_KERNEL,
- early_pgtable_alloc,
- debug_pagealloc_enabled());
- return;
- }
-
- /*
- * This block overlaps the kernel text/rodata mappings.
- * Map the portion(s) which don't overlap.
- */
- if (start < kernel_start)
- __create_pgd_mapping(pgd, start,
- __phys_to_virt(start),
- kernel_start - start, PAGE_KERNEL,
- early_pgtable_alloc,
- debug_pagealloc_enabled());
- if (kernel_end < end)
- __create_pgd_mapping(pgd, kernel_end,
- __phys_to_virt(kernel_end),
- end - kernel_end, PAGE_KERNEL,
- early_pgtable_alloc,
- debug_pagealloc_enabled());
-
- /*
- * Map the linear alias of the [_text, __init_begin) interval as
- * read-only/non-executable. This makes the contents of the
- * region accessible to subsystems such as hibernate, but
- * protects it from inadvertent modification or execution.
- */
- __create_pgd_mapping(pgd, kernel_start, __phys_to_virt(kernel_start),
- kernel_end - kernel_start, PAGE_KERNEL_RO,
- early_pgtable_alloc, debug_pagealloc_enabled());
-}
-
-static void __init map_mem(pgd_t *pgd)
-{
- struct memblock_region *reg;
+#ifdef CONFIG_KEXEC_CORE
+ if (crashk_res.end)
+ memblock_mark_nomap(crashk_res.start,
+ resource_size(&crashk_res));
+#endif

/* map all the memory banks */
for_each_memblock(memory, reg) {
@@ -398,8 +375,33 @@ static void __init map_mem(pgd_t *pgd)
if (memblock_is_nomap(reg))
continue;

- __map_memblock(pgd, start, end);
+ __map_memblock(pgd, start, end,
+ PAGE_KERNEL, debug_pagealloc_enabled());
+ }
+
+ /*
+ * Map the linear alias of the [_text, __init_begin) interval as
+ * read-only/non-executable. This makes the contents of the
+ * region accessible to subsystems such as hibernate, but
+ * protects it from inadvertent modification or execution.
+ */
+ __map_memblock(pgd, kernel_start, kernel_end,
+ PAGE_KERNEL_RO, debug_pagealloc_enabled());
+ memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
+
+#ifdef CONFIG_KEXEC_CORE
+ /*
+ * User page-level mappings here so that we can shrink the region
+ * in page granularity and put back unused memory to buddy system
+ * through /sys/kernel/kexec_crash_size interface.
+ */
+ if (crashk_res.end) {
+ __map_memblock(pgd, crashk_res.start, crashk_res.end + 1,
+ PAGE_KERNEL, true);
+ memblock_clear_nomap(crashk_res.start,
+ resource_size(&crashk_res));
}
+#endif
}

void mark_rodata_ro(void)
--
2.11.1
AKASHI Takahiro
2017-03-15 09:59:36 UTC
Permalink
Since arch_kexec_protect_crashkres() removes a mapping for crash dump
kernel image, the loaded data won't be preserved around hibernation.

In this patch, helper functions, kexec_prepare_suspend()/
kexec_post_resume(), are additionally called before/after hibernation so
that the relevant memory segments will be mapped again and preserved just
as the others are.

In addition, to minimize the size of hibernation image,
kexec_is_chraskres_nosave() is added to pfn_is_nosave() in order to
recoginize only the pages that hold loaded crash dump kernel image as
saveable. Hibernation excludes any pages that are marked as Reserved and
yet "nosave."

Signed-off-by: AKASHI Takahiro <***@linaro.org>
---
arch/arm64/include/asm/kexec.h | 10 ++++++
arch/arm64/kernel/hibernate.c | 10 +++++-
arch/arm64/kernel/machine_kexec.c | 73 ++++++++++++++++++++++++++++++++++++++-
arch/arm64/mm/init.c | 27 +++++++++++++++
4 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 04744dc5fb61..b9b31fc781b9 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -43,6 +43,16 @@ static inline void crash_setup_regs(struct pt_regs *newregs,
/* Empty routine needed to avoid build errors. */
}

+#if defined(CONFIG_KEXEC_CORE) && defined(CONFIG_HIBERNATION)
+extern bool kexec_is_crashkres_nosave(unsigned long pfn);
+extern void kexec_prepare_suspend(void);
+extern void kexec_post_resume(void);
+#else
+static inline bool kexec_is_crashkres_nosave(unsigned long pfn) {return false; }
+static inline void kexec_prepare_suspend(void) {}
+static inline void kexec_post_resume(void) {}
+#endif
+
#endif /* __ASSEMBLY__ */

#endif
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 97a7384100f3..1e10fafa59bd 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -28,6 +28,7 @@
#include <asm/cacheflush.h>
#include <asm/cputype.h>
#include <asm/irqflags.h>
+#include <asm/kexec.h>
#include <asm/memory.h>
#include <asm/mmu_context.h>
#include <asm/pgalloc.h>
@@ -102,7 +103,8 @@ int pfn_is_nosave(unsigned long pfn)
unsigned long nosave_begin_pfn = sym_to_pfn(&__nosave_begin);
unsigned long nosave_end_pfn = sym_to_pfn(&__nosave_end - 1);

- return (pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn);
+ return ((pfn >= nosave_begin_pfn) && (pfn <= nosave_end_pfn)) ||
+ kexec_is_crashkres_nosave(pfn);
}

void notrace save_processor_state(void)
@@ -286,6 +288,9 @@ int swsusp_arch_suspend(void)
local_dbg_save(flags);

if (__cpu_suspend_enter(&state)) {
+ /* make the crash dump kernel image visible/saveable */
+ kexec_prepare_suspend();
+
sleep_cpu = smp_processor_id();
ret = swsusp_save();
} else {
@@ -297,6 +302,9 @@ int swsusp_arch_suspend(void)
if (el2_reset_needed())
dcache_clean_range(__hyp_idmap_text_start, __hyp_idmap_text_end);

+ /* make the crash dump kernel image protected again */
+ kexec_post_resume();
+
/*
* Tell the hibernation core that we've just restored
* the memory
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 02e4f929db3b..82f48db589cf 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -10,6 +10,7 @@
*/

#include <linux/kexec.h>
+#include <linux/page-flags.h>
#include <linux/smp.h>

#include <asm/cacheflush.h>
@@ -220,7 +221,6 @@ void arch_kexec_protect_crashkres(void)
kexec_crash_image->segment[i].memsz,
PAGE_KERNEL_INVALID, true);

-
flush_tlb_all();
}

@@ -233,3 +233,74 @@ void arch_kexec_unprotect_crashkres(void)
__phys_to_virt(kexec_crash_image->segment[i].mem),
kexec_crash_image->segment[i].memsz, PAGE_KERNEL, true);
}
+
+#ifdef CONFIG_HIBERNATION
+/*
+ * To preserve the crash dump kernel image, the relevant memory segments
+ * should be mapped again around the hibernation.
+ */
+void kexec_prepare_suspend(void)
+{
+ if (kexec_crash_image)
+ arch_kexec_unprotect_crashkres();
+}
+
+void kexec_post_resume(void)
+{
+ if (kexec_crash_image)
+ arch_kexec_protect_crashkres();
+}
+
+/*
+ * kexec_is_crashres_nosave
+ *
+ * Return true only if a page is part of reserved memory for crash dump kernel,
+ * but does not hold any data of loaded kernel image.
+ *
+ * Note that all the pages in crash dump kernel memory have been initially
+ * marked as Reserved in kexec_reserve_crashkres_pages().
+ *
+ * In hibernation, the pages which are Reserved and yet "nosave"
+ * are excluded from the hibernation iamge. kexec_is_crashkres_nosave()
+ * does this check for crash dump kernel and will reduce the total size
+ * of hibernation image.
+ */
+
+bool kexec_is_crashkres_nosave(unsigned long pfn)
+{
+ int i;
+ phys_addr_t addr;
+
+ /* in reserved memory? */
+ if (!crashk_res.end)
+ return false;
+
+ addr = __pfn_to_phys(pfn);
+ if ((addr < crashk_res.start) || (crashk_res.end < addr))
+ return false;
+
+ /* not part of loaded kernel image? */
+ if (!kexec_crash_image)
+ return true;
+
+ for (i = 0; i < kexec_crash_image->nr_segments; i++)
+ if (addr >= kexec_crash_image->segment[i].mem &&
+ addr < (kexec_crash_image->segment[i].mem +
+ kexec_crash_image->segment[i].memsz))
+ return false;
+
+ return true;
+}
+
+void crash_free_reserved_phys_range(unsigned long begin, unsigned long end)
+{
+ unsigned long addr;
+ struct page *page;
+
+ for (addr = begin; addr < end; addr += PAGE_SIZE) {
+ page = phys_to_page(addr);
+ ClearPageReserved(page);
+ free_reserved_page(page);
+ }
+}
+#endif /* CONFIG_HIBERNATION */
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 09d19207362d..89ba3cd0fe44 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -134,10 +134,35 @@ static void __init reserve_crashkernel(void)
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
}
+
+static void __init kexec_reserve_crashkres_pages(void)
+{
+#ifdef CONFIG_HIBERNATION
+ phys_addr_t addr;
+ struct page *page;
+
+ if (!crashk_res.end)
+ return;
+
+ /*
+ * To reduce the size of hibernation image, all the pages are
+ * marked as Reserved initially.
+ */
+ for (addr = crashk_res.start; addr < (crashk_res.end + 1);
+ addr += PAGE_SIZE) {
+ page = phys_to_page(addr);
+ SetPageReserved(page);
+ }
+#endif
+}
#else
static void __init reserve_crashkernel(void)
{
}
+
+static void __init kexec_reserve_crashkres_pages(void)
+{
+}
#endif /* CONFIG_KEXEC_CORE */

/*
@@ -517,6 +542,8 @@ void __init mem_init(void)
/* this will put all unused low memory onto the freelists */
free_all_bootmem();

+ kexec_reserve_crashkres_pages();
+
mem_init_print_info(NULL);

#define MLK(b, t) b, t, ((t) - (b)) >> 10
--
2.11.1
James Morse
2017-03-21 18:25:46 UTC
Permalink
Hi Akashi,
Post by AKASHI Takahiro
Since arch_kexec_protect_crashkres() removes a mapping for crash dump
kernel image, the loaded data won't be preserved around hibernation.
In this patch, helper functions, kexec_prepare_suspend()/
kexec_post_resume(), are additionally called before/after hibernation so
that the relevant memory segments will be mapped again and preserved just
as the others are.
In addition, to minimize the size of hibernation image,
kexec_is_chraskres_nosave() is added to pfn_is_nosave() in order to
(crashkres)
Post by AKASHI Takahiro
recoginize only the pages that hold loaded crash dump kernel image as
(recognize)
Post by AKASHI Takahiro
saveable. Hibernation excludes any pages that are marked as Reserved and
yet "nosave."
Neat! I didn't think this would be possible without hacking kernel/power/snapshot.c.

I've given this a spin on Juno and Seattle, I even added debug_pagealloc, but
that doesn't trick it because your kexec_prepare_suspend() puts the mapping back.

Reviewed-by: James Morse <***@arm.com>


Thanks,

James
Post by AKASHI Takahiro
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 97a7384100f3..1e10fafa59bd 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -286,6 +288,9 @@ int swsusp_arch_suspend(void)
local_dbg_save(flags);
if (__cpu_suspend_enter(&state)) {
+ /* make the crash dump kernel image visible/saveable */
+ kexec_prepare_suspend();
Strictly this is kdump not kexec, but the comment makes that clear.
Post by AKASHI Takahiro
+
sleep_cpu = smp_processor_id();
ret = swsusp_save();
} else {
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 02e4f929db3b..82f48db589cf 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -220,7 +221,6 @@ void arch_kexec_protect_crashkres(void)
kexec_crash_image->segment[i].memsz,
PAGE_KERNEL_INVALID, true);
-
Stray whitespace change from a previous patch?
Post by AKASHI Takahiro
flush_tlb_all();
}
@@ -233,3 +233,74 @@ void arch_kexec_unprotect_crashkres(void)
+/*
+ * kexec_is_crashres_nosave
+ *
+ * Return true only if a page is part of reserved memory for crash dump kernel,
+ * but does not hold any data of loaded kernel image.
+ *
+ * Note that all the pages in crash dump kernel memory have been initially
+ * marked as Reserved in kexec_reserve_crashkres_pages().
+ *
+ * In hibernation, the pages which are Reserved and yet "nosave"
+ * are excluded from the hibernation iamge. kexec_is_crashkres_nosave()
+ * does this check for crash dump kernel and will reduce the total size
+ * of hibernation image.
+ */
+
+bool kexec_is_crashkres_nosave(unsigned long pfn)
+{
+ int i;
+ phys_addr_t addr;
+
+ /* in reserved memory? */
Comment in the wrong place?
Post by AKASHI Takahiro
+ if (!crashk_res.end)
+ return false;
+
+ addr = __pfn_to_phys(pfn);
(makes more sense here)
Post by AKASHI Takahiro
+ if ((addr < crashk_res.start) || (crashk_res.end < addr))
+ return false;
+
+ /* not part of loaded kernel image? */
Comment in the wrong place?
Post by AKASHI Takahiro
+ if (!kexec_crash_image)
+ return true;
+
(makes more sense here)
Post by AKASHI Takahiro
+ for (i = 0; i < kexec_crash_image->nr_segments; i++)
+ if (addr >= kexec_crash_image->segment[i].mem &&
+ addr < (kexec_crash_image->segment[i].mem +
+ kexec_crash_image->segment[i].memsz))
+ return false;
+
+ return true;
+}
+
Thanks,

James
AKASHI Takahiro
2017-03-23 11:29:30 UTC
Permalink
Post by James Morse
Hi Akashi,
Post by AKASHI Takahiro
Since arch_kexec_protect_crashkres() removes a mapping for crash dump
kernel image, the loaded data won't be preserved around hibernation.
In this patch, helper functions, kexec_prepare_suspend()/
kexec_post_resume(), are additionally called before/after hibernation so
that the relevant memory segments will be mapped again and preserved just
as the others are.
In addition, to minimize the size of hibernation image,
kexec_is_chraskres_nosave() is added to pfn_is_nosave() in order to
(crashkres)
Yes.
Post by James Morse
Post by AKASHI Takahiro
recoginize only the pages that hold loaded crash dump kernel image as
(recognize)
Ah, yes ...
Post by James Morse
Post by AKASHI Takahiro
saveable. Hibernation excludes any pages that are marked as Reserved and
yet "nosave."
Neat! I didn't think this would be possible without hacking kernel/power/snapshot.c.
I've given this a spin on Juno and Seattle, I even added debug_pagealloc, but
that doesn't trick it because your kexec_prepare_suspend() puts the mapping back.
Thanks,
James
Post by AKASHI Takahiro
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 97a7384100f3..1e10fafa59bd 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -286,6 +288,9 @@ int swsusp_arch_suspend(void)
local_dbg_save(flags);
if (__cpu_suspend_enter(&state)) {
+ /* make the crash dump kernel image visible/saveable */
+ kexec_prepare_suspend();
Strictly this is kdump not kexec, but the comment makes that clear.
I hesitated to use "kdump_" here as there are no functions with
such a prefix (say, even arch_kexec_protect_crashkres()).
So probably "crash_" or "crash_kexec_" would sound much better.
Post by James Morse
Post by AKASHI Takahiro
+
sleep_cpu = smp_processor_id();
ret = swsusp_save();
} else {
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 02e4f929db3b..82f48db589cf 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -220,7 +221,6 @@ void arch_kexec_protect_crashkres(void)
kexec_crash_image->segment[i].memsz,
PAGE_KERNEL_INVALID, true);
-
Stray whitespace change from a previous patch?
Oops, fix it.
Post by James Morse
Post by AKASHI Takahiro
flush_tlb_all();
}
@@ -233,3 +233,74 @@ void arch_kexec_unprotect_crashkres(void)
+/*
+ * kexec_is_crashres_nosave
+ *
+ * Return true only if a page is part of reserved memory for crash dump kernel,
+ * but does not hold any data of loaded kernel image.
+ *
+ * Note that all the pages in crash dump kernel memory have been initially
+ * marked as Reserved in kexec_reserve_crashkres_pages().
+ *
+ * In hibernation, the pages which are Reserved and yet "nosave"
+ * are excluded from the hibernation iamge. kexec_is_crashkres_nosave()
+ * does this check for crash dump kernel and will reduce the total size
+ * of hibernation image.
+ */
+
+bool kexec_is_crashkres_nosave(unsigned long pfn)
+{
+ int i;
+ phys_addr_t addr;
+
+ /* in reserved memory? */
Comment in the wrong place?
This is after my deep thinking :) but
Post by James Morse
Post by AKASHI Takahiro
+ if (!crashk_res.end)
+ return false;
+
+ addr = __pfn_to_phys(pfn);
(makes more sense here)
if you think so, I will follow you.
Post by James Morse
Post by AKASHI Takahiro
+ if ((addr < crashk_res.start) || (crashk_res.end < addr))
+ return false;
+
+ /* not part of loaded kernel image? */
Comment in the wrong place?
Post by AKASHI Takahiro
+ if (!kexec_crash_image)
+ return true;
+
(makes more sense here)
ditto
Post by James Morse
Post by AKASHI Takahiro
+ for (i = 0; i < kexec_crash_image->nr_segments; i++)
+ if (addr >= kexec_crash_image->segment[i].mem &&
+ addr < (kexec_crash_image->segment[i].mem +
+ kexec_crash_image->segment[i].memsz))
+ return false;
+
+ return true;
+}
+
Thanks,
James
Thank you for your review!
-Takahiro AKASHI
AKASHI Takahiro
2017-03-15 09:59:38 UTC
Permalink
In addition to common VMCOREINFO's defined in
crash_save_vmcoreinfo_init(), we need to know, for crash utility,
- kimage_voffset
- PHYS_OFFSET
to examine the contents of a dump file (/proc/vmcore) correctly
due to the introduction of KASLR (CONFIG_RANDOMIZE_BASE) in v4.6.

- VA_BITS
is also required for makedumpfile command.

arch_crash_save_vmcoreinfo() appends them to the dump file.
More VMCOREINFO's may be added later.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/kernel/machine_kexec.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 2c1108c51d69..68b96ea13b4c 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -18,6 +18,7 @@

#include <asm/cacheflush.h>
#include <asm/cpu_ops.h>
+#include <asm/memory.h>
#include <asm/mmu.h>
#include <asm/mmu_context.h>

@@ -351,3 +352,13 @@ void crash_free_reserved_phys_range(unsigned long begin, unsigned long end)
}
}
#endif /* CONFIG_HIBERNATION */
+
+void arch_crash_save_vmcoreinfo(void)
+{
+ VMCOREINFO_NUMBER(VA_BITS);
+ /* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
+ vmcoreinfo_append_str("NUMBER(kimage_voffset)=0x%llx\n",
+ kimage_voffset);
+ vmcoreinfo_append_str("NUMBER(PHYS_OFFSET)=0x%llx\n",
+ PHYS_OFFSET);
+}
--
2.11.1
AKASHI Takahiro
2017-03-15 09:59:37 UTC
Permalink
Primary kernel calls machine_crash_shutdown() to shut down non-boot cpus
and save registers' status in per-cpu ELF notes before starting crash
dump kernel. See kernel_kexec().
Even if not all secondary cpus have shut down, we do kdump anyway.

As we don't have to make non-boot(crashed) cpus offline (to preserve
correct status of cpus at crash dump) before shutting down, this patch
also adds a variant of smp_send_stop().

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 42 +++++++++++++++++++++++++-
arch/arm64/include/asm/smp.h | 2 ++
arch/arm64/kernel/machine_kexec.c | 55 +++++++++++++++++++++++++++++++---
arch/arm64/kernel/smp.c | 63 +++++++++++++++++++++++++++++++++++++++
5 files changed, 158 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/hardirq.h b/arch/arm64/include/asm/hardirq.h
index 8740297dac77..1473fc2f7ab7 100644
--- a/arch/arm64/include/asm/hardirq.h
+++ b/arch/arm64/include/asm/hardirq.h
@@ -20,7 +20,7 @@
#include <linux/threads.h>
#include <asm/irq.h>

-#define NR_IPI 6
+#define NR_IPI 7

typedef struct {
unsigned int __softirq_pending;
diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index b9b31fc781b9..35f1d24b323a 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -40,7 +40,47 @@
static inline void crash_setup_regs(struct pt_regs *newregs,
struct pt_regs *oldregs)
{
- /* Empty routine needed to avoid build errors. */
+ if (oldregs) {
+ memcpy(newregs, oldregs, sizeof(*newregs));
+ } else {
+ u64 tmp1, tmp2;
+
+ __asm__ __volatile__ (
+ "stp x0, x1, [%2, #16 * 0]\n"
+ "stp x2, x3, [%2, #16 * 1]\n"
+ "stp x4, x5, [%2, #16 * 2]\n"
+ "stp x6, x7, [%2, #16 * 3]\n"
+ "stp x8, x9, [%2, #16 * 4]\n"
+ "stp x10, x11, [%2, #16 * 5]\n"
+ "stp x12, x13, [%2, #16 * 6]\n"
+ "stp x14, x15, [%2, #16 * 7]\n"
+ "stp x16, x17, [%2, #16 * 8]\n"
+ "stp x18, x19, [%2, #16 * 9]\n"
+ "stp x20, x21, [%2, #16 * 10]\n"
+ "stp x22, x23, [%2, #16 * 11]\n"
+ "stp x24, x25, [%2, #16 * 12]\n"
+ "stp x26, x27, [%2, #16 * 13]\n"
+ "stp x28, x29, [%2, #16 * 14]\n"
+ "mov %0, sp\n"
+ "stp x30, %0, [%2, #16 * 15]\n"
+
+ "/* faked current PSTATE */\n"
+ "mrs %0, CurrentEL\n"
+ "mrs %1, SPSEL\n"
+ "orr %0, %0, %1\n"
+ "mrs %1, DAIF\n"
+ "orr %0, %0, %1\n"
+ "mrs %1, NZCV\n"
+ "orr %0, %0, %1\n"
+ /* pc */
+ "adr %1, 1f\n"
+ "1:\n"
+ "stp %1, %0, [%2, #16 * 16]\n"
+ : "=&r" (tmp1), "=&r" (tmp2)
+ : "r" (newregs)
+ : "memory"
+ );
+ }
}

#if defined(CONFIG_KEXEC_CORE) && defined(CONFIG_HIBERNATION)
diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index d050d720a1b4..cea009f2657d 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -148,6 +148,8 @@ static inline void cpu_panic_kernel(void)
*/
bool cpus_are_stuck_in_kernel(void);

+extern void smp_send_crash_stop(void);
+
#endif /* ifndef __ASSEMBLY__ */

#endif /* ifndef __ASM_SMP_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 82f48db589cf..2c1108c51d69 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -9,6 +9,9 @@
* published by the Free Software Foundation.
*/

+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
#include <linux/kexec.h>
#include <linux/page-flags.h>
#include <linux/smp.h>
@@ -146,7 +149,8 @@ void machine_kexec(struct kimage *kimage)
/*
* New cpus may have become stuck_in_kernel after we loaded the image.
*/
- BUG_ON(cpus_are_stuck_in_kernel() || (num_online_cpus() > 1));
+ BUG_ON((cpus_are_stuck_in_kernel() || (num_online_cpus() > 1)) &&
+ !WARN_ON(kimage == kexec_crash_image));

reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
@@ -198,15 +202,58 @@ void machine_kexec(struct kimage *kimage)
* relocation is complete.
*/

- cpu_soft_restart(1, reboot_code_buffer_phys, kimage->head,
- kimage->start, 0);
+ cpu_soft_restart(kimage != kexec_crash_image,
+ reboot_code_buffer_phys, kimage->head, kimage->start, 0);

BUG(); /* Should never get here. */
}

+static void machine_kexec_mask_interrupts(void)
+{
+ unsigned int i;
+ struct irq_desc *desc;
+
+ for_each_irq_desc(i, desc) {
+ struct irq_chip *chip;
+ int ret;
+
+ chip = irq_desc_get_chip(desc);
+ if (!chip)
+ continue;
+
+ /*
+ * First try to remove the active state. If this
+ * fails, try to EOI the interrupt.
+ */
+ ret = irq_set_irqchip_state(i, IRQCHIP_STATE_ACTIVE, false);
+
+ if (ret && irqd_irq_inprogress(&desc->irq_data) &&
+ chip->irq_eoi)
+ chip->irq_eoi(&desc->irq_data);
+
+ if (chip->irq_mask)
+ chip->irq_mask(&desc->irq_data);
+
+ if (chip->irq_disable && !irqd_irq_disabled(&desc->irq_data))
+ chip->irq_disable(&desc->irq_data);
+ }
+}
+
+/**
+ * machine_crash_shutdown - shutdown non-crashing cpus and save registers
+ */
void machine_crash_shutdown(struct pt_regs *regs)
{
- /* Empty routine needed to avoid build errors. */
+ local_irq_disable();
+
+ /* shutdown non-crashing cpus */
+ smp_send_crash_stop();
+
+ /* for crashing cpu */
+ crash_save_cpu(regs, smp_processor_id());
+ machine_kexec_mask_interrupts();
+
+ pr_info("Starting crashdump kernel...\n");
}

void arch_kexec_protect_crashkres(void)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index ef1caae02110..a7e2921143c4 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -39,6 +39,7 @@
#include <linux/completion.h>
#include <linux/of.h>
#include <linux/irq_work.h>
+#include <linux/kexec.h>

#include <asm/alternative.h>
#include <asm/atomic.h>
@@ -76,6 +77,7 @@ enum ipi_msg_type {
IPI_RESCHEDULE,
IPI_CALL_FUNC,
IPI_CPU_STOP,
+ IPI_CPU_CRASH_STOP,
IPI_TIMER,
IPI_IRQ_WORK,
IPI_WAKEUP
@@ -755,6 +757,7 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
S(IPI_RESCHEDULE, "Rescheduling interrupts"),
S(IPI_CALL_FUNC, "Function call interrupts"),
S(IPI_CPU_STOP, "CPU stop interrupts"),
+ S(IPI_CPU_CRASH_STOP, "CPU stop (for crash dump) interrupts"),
S(IPI_TIMER, "Timer broadcast interrupts"),
S(IPI_IRQ_WORK, "IRQ work interrupts"),
S(IPI_WAKEUP, "CPU wake-up interrupts"),
@@ -829,6 +832,29 @@ static void ipi_cpu_stop(unsigned int cpu)
cpu_relax();
}

+#ifdef CONFIG_KEXEC_CORE
+static atomic_t waiting_for_crash_ipi;
+#endif
+
+static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
+{
+#ifdef CONFIG_KEXEC_CORE
+ crash_save_cpu(regs, cpu);
+
+ atomic_dec(&waiting_for_crash_ipi);
+
+ local_irq_disable();
+
+#ifdef CONFIG_HOTPLUG_CPU
+ if (cpu_ops[cpu]->cpu_die)
+ cpu_ops[cpu]->cpu_die(cpu);
+#endif
+
+ /* just in case */
+ cpu_park_loop();
+#endif
+}
+
/*
* Main handler for inter-processor interrupts
*/
@@ -859,6 +885,15 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
irq_exit();
break;

+ case IPI_CPU_CRASH_STOP:
+ if (IS_ENABLED(CONFIG_KEXEC_CORE)) {
+ irq_enter();
+ ipi_cpu_crash_stop(cpu, regs);
+
+ unreachable();
+ }
+ break;
+
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
case IPI_TIMER:
irq_enter();
@@ -931,6 +966,34 @@ void smp_send_stop(void)
cpumask_pr_args(cpu_online_mask));
}

+#ifdef CONFIG_KEXEC_CORE
+void smp_send_crash_stop(void)
+{
+ cpumask_t mask;
+ unsigned long timeout;
+
+ if (num_online_cpus() == 1)
+ return;
+
+ cpumask_copy(&mask, cpu_online_mask);
+ cpumask_clear_cpu(smp_processor_id(), &mask);
+
+ atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
+
+ pr_crit("SMP: stopping secondary CPUs\n");
+ smp_cross_call(&mask, IPI_CPU_CRASH_STOP);
+
+ /* Wait up to one second for other CPUs to stop */
+ timeout = USEC_PER_SEC;
+ while ((atomic_read(&waiting_for_crash_ipi) > 0) && timeout--)
+ udelay(1);
+
+ if (atomic_read(&waiting_for_crash_ipi) > 0)
+ pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
+ cpumask_pr_args(cpu_online_mask));
+}
+#endif
+
/*
* not supported here
*/
--
2.11.1
AKASHI Takahiro
2017-03-15 09:59:41 UTC
Permalink
Add arch specific descriptions about kdump usage on arm64 to kdump.txt.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Baoquan He <***@redhat.com>
Acked-by: Dave Young <***@redhat.com>
Acked-by: Catalin Marinas <***@arm.com>
---
Documentation/kdump/kdump.txt | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index b0eb27b956d9..615434d81108 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -18,7 +18,7 @@ memory image to a dump file on the local disk, or across the network to
a remote system.

Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
-s390x and arm architectures.
+s390x, arm and arm64 architectures.

When the system kernel boots, it reserves a small section of memory for
the dump-capture kernel. This ensures that ongoing Direct Memory Access
@@ -249,6 +249,13 @@ Dump-capture kernel config options (Arch Dependent, arm)

AUTO_ZRELADDR=y

+Dump-capture kernel config options (Arch Dependent, arm64)
+----------------------------------------------------------
+
+- Please note that kvm of the dump-capture kernel will not be enabled
+ on non-VHE systems even if it is configured. This is because the CPU
+ will not be reset to EL2 on panic.
+
Extended crashkernel syntax
===========================

@@ -305,6 +312,8 @@ Boot into System Kernel
kernel will automatically locate the crash kernel image within the
first 512MB of RAM if X is not given.

+ On arm64, use "crashkernel=Y[@X]". Note that the start address of
+ the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).

Load the Dump-capture Kernel
============================
@@ -327,6 +336,8 @@ For s390x:
- Use image or bzImage
For arm:
- Use zImage
+For arm64:
+ - Use vmlinux or Image

If you are using a uncompressed vmlinux image then use following command
to load dump-capture kernel.
@@ -370,6 +381,9 @@ For s390x:
For arm:
"1 maxcpus=1 reset_devices"

+For arm64:
+ "1 maxcpus=1 reset_devices"
+
Notes on loading the dump-capture kernel:

* By default, the ELF headers are stored in ELF64 format to support
--
2.11.1
AKASHI Takahiro
2017-03-15 09:59:01 UTC
Permalink
Add memblock_cap_memory_range() which will remove all the memblock regions
except the memory range specified in the arguments. In addition, rework is
done on memblock_mem_limit_remove_map() to re-implement it using
memblock_cap_memory_range().

This function, like memblock_mem_limit_remove_map(), will not remove
memblocks with MEMMAP_NOMAP attribute as they may be mapped and accessed
later as "device memory."
See the commit a571d4eb55d8 ("mm/memblock.c: add new infrastructure to
address the mem limit issue").

This function is used, in a succeeding patch in the series of arm64 kdump
suuport, to limit the range of usable memory, or System RAM, on crash dump
kernel.
(Please note that "mem=" parameter is of little use for this purpose.)

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Will Deacon <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
Acked-by: Dennis Chen <***@arm.com>
Cc: linux-***@kvack.org
Cc: Andrew Morton <***@linux-foundation.org>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 44 +++++++++++++++++++++++++++++---------------
2 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e82daffcfc44..4ce24a376262 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -336,6 +336,7 @@ phys_addr_t memblock_mem_size(unsigned long limit_pfn);
phys_addr_t memblock_start_of_DRAM(void);
phys_addr_t memblock_end_of_DRAM(void);
void memblock_enforce_memory_limit(phys_addr_t memory_limit);
+void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
void memblock_mem_limit_remove_map(phys_addr_t limit);
bool memblock_is_memory(phys_addr_t addr);
int memblock_is_map_memory(phys_addr_t addr);
diff --git a/mm/memblock.c b/mm/memblock.c
index 2f4ca8104ea4..b049c9b2dba8 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1543,11 +1543,37 @@ void __init memblock_enforce_memory_limit(phys_addr_t limit)
(phys_addr_t)ULLONG_MAX);
}

+void __init memblock_cap_memory_range(phys_addr_t base, phys_addr_t size)
+{
+ int start_rgn, end_rgn;
+ int i, ret;
+
+ if (!size)
+ return;
+
+ ret = memblock_isolate_range(&memblock.memory, base, size,
+ &start_rgn, &end_rgn);
+ if (ret)
+ return;
+
+ /* remove all the MAP regions */
+ for (i = memblock.memory.cnt - 1; i >= end_rgn; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ for (i = start_rgn - 1; i >= 0; i--)
+ if (!memblock_is_nomap(&memblock.memory.regions[i]))
+ memblock_remove_region(&memblock.memory, i);
+
+ /* truncate the reserved regions */
+ memblock_remove_range(&memblock.reserved, 0, base);
+ memblock_remove_range(&memblock.reserved,
+ base + size, (phys_addr_t)ULLONG_MAX);
+}
+
void __init memblock_mem_limit_remove_map(phys_addr_t limit)
{
- struct memblock_type *type = &memblock.memory;
phys_addr_t max_addr;
- int i, ret, start_rgn, end_rgn;

if (!limit)
return;
@@ -1558,19 +1584,7 @@ void __init memblock_mem_limit_remove_map(phys_addr_t limit)
if (max_addr == (phys_addr_t)ULLONG_MAX)
return;

- ret = memblock_isolate_range(type, max_addr, (phys_addr_t)ULLONG_MAX,
- &start_rgn, &end_rgn);
- if (ret)
- return;
-
- /* remove all the MAP regions above the limit */
- for (i = end_rgn - 1; i >= start_rgn; i--) {
- if (!memblock_is_nomap(&type->regions[i]))
- memblock_remove_region(type, i);
- }
- /* truncate the reserved regions */
- memblock_remove_range(&memblock.reserved, max_addr,
- (phys_addr_t)ULLONG_MAX);
+ memblock_cap_memory_range(0, max_addr);
}

static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
--
2.11.1
AKASHI Takahiro
2017-03-15 09:59:32 UTC
Permalink
Crash dump kernel uses only a limited range of available memory as System
RAM. On arm64 kdump, This memory range is advertised to crash dump kernel
via a device-tree property under /chosen,
linux,usable-memory-range = <BASE SIZE>

Crash dump kernel reads this property at boot time and calls
memblock_cap_memory_range() to limit usable memory which are listed either
in UEFI memory map table or "memory" nodes of a device tree blob.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: Geoff Levand <***@infradead.org>
Acked-by: Catalin Marinas <***@arm.com>
Acked-by: Mark Rutland <***@arm.com>
---
arch/arm64/mm/init.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index e19e06593e37..290794b1a0f1 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -188,10 +188,45 @@ static int __init early_mem(char *p)
}
early_param("mem", early_mem);

+static int __init early_init_dt_scan_usablemem(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ struct memblock_region *usablemem = data;
+ const __be32 *reg;
+ int len;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,usable-memory-range", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ usablemem->base = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ usablemem->size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+static void __init fdt_enforce_memory_region(void)
+{
+ struct memblock_region reg = {
+ .size = 0,
+ };
+
+ of_scan_flat_dt(early_init_dt_scan_usablemem, &reg);
+
+ if (reg.size)
+ memblock_cap_memory_range(reg.base, reg.size);
+}
+
void __init arm64_memblock_init(void)
{
const s64 linear_region_size = -(s64)PAGE_OFFSET;

+ /* Handle linux,usable-memory-range property */
+ fdt_enforce_memory_region();
+
/*
* Ensure that the linear region takes up exactly half of the kernel
* virtual address space. This way, we can distinguish a linear address
--
2.11.1
AKASHI Takahiro
2017-03-15 09:59:33 UTC
Permalink
"crashkernel=" kernel parameter specifies the size (and optionally
the start address) of the system ram to be used by crash dump kernel.
reserve_crashkernel() will allocate and reserve that memory at boot time
of primary kernel.

The memory range will be exposed to userspace as a resource named
"Crash kernel" in /proc/iomem.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Signed-off-by: Mark Salter <***@redhat.com>
Signed-off-by: Pratyush Anand <***@redhat.com>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/kernel/setup.c | 7 ++++-
arch/arm64/mm/init.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 42274bda0ccb..28855ec1be95 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -31,7 +31,6 @@
#include <linux/screen_info.h>
#include <linux/init.h>
#include <linux/kexec.h>
-#include <linux/crash_dump.h>
#include <linux/root_dev.h>
#include <linux/cpu.h>
#include <linux/interrupt.h>
@@ -226,6 +225,12 @@ static void __init request_standard_resources(void)
if (kernel_data.start >= res->start &&
kernel_data.end <= res->end)
request_resource(res, &kernel_data);
+#ifdef CONFIG_KEXEC_CORE
+ /* Userspace will find "Crash kernel" region in /proc/iomem. */
+ if (crashk_res.end && crashk_res.start >= res->start &&
+ crashk_res.end <= res->end)
+ request_resource(res, &crashk_res);
+#endif
}
}

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 290794b1a0f1..09d19207362d 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -30,6 +30,7 @@
#include <linux/gfp.h>
#include <linux/memblock.h>
#include <linux/sort.h>
+#include <linux/of.h>
#include <linux/of_fdt.h>
#include <linux/dma-mapping.h>
#include <linux/dma-contiguous.h>
@@ -37,6 +38,7 @@
#include <linux/swiotlb.h>
#include <linux/vmalloc.h>
#include <linux/mm.h>
+#include <linux/kexec.h>

#include <asm/boot.h>
#include <asm/fixmap.h>
@@ -77,6 +79,67 @@ static int __init early_initrd(char *p)
early_param("initrd", early_initrd);
#endif

+#ifdef CONFIG_KEXEC_CORE
+/*
+ * reserve_crashkernel() - reserves memory for crash kernel
+ *
+ * This function reserves memory area given in "crashkernel=" kernel command
+ * line parameter. The memory reserved is used by dump capture kernel when
+ * primary kernel is crashing.
+ */
+static void __init reserve_crashkernel(void)
+{
+ unsigned long long crash_base, crash_size;
+ int ret;
+
+ ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
+ &crash_size, &crash_base);
+ /* no crashkernel= or invalid value specified */
+ if (ret || !crash_size)
+ return;
+
+ crash_size = PAGE_ALIGN(crash_size);
+
+ if (crash_base == 0) {
+ /* Current arm64 boot protocol requires 2MB alignment */
+ crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
+ crash_size, SZ_2M);
+ if (crash_base == 0) {
+ pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
+ crash_size);
+ return;
+ }
+ } else {
+ /* User specifies base address explicitly. */
+ if (!memblock_is_region_memory(crash_base, crash_size)) {
+ pr_warn("cannot reserve crashkernel: region is not memory\n");
+ return;
+ }
+
+ if (memblock_is_region_reserved(crash_base, crash_size)) {
+ pr_warn("cannot reserve crashkernel: region overlaps reserved memory\n");
+ return;
+ }
+
+ if (!IS_ALIGNED(crash_base, SZ_2M)) {
+ pr_warn("cannot reserve crashkernel: base address is not 2MB aligned\n");
+ return;
+ }
+ }
+ memblock_reserve(crash_base, crash_size);
+
+ pr_info("crashkernel reserved: 0x%016llx - 0x%016llx (%lld MB)\n",
+ crash_base, crash_base + crash_size, crash_size >> 20);
+
+ crashk_res.start = crash_base;
+ crashk_res.end = crash_base + crash_size - 1;
+}
+#else
+static void __init reserve_crashkernel(void)
+{
+}
+#endif /* CONFIG_KEXEC_CORE */
+
/*
* Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
* currently assumes that for memory starting above 4G, 32-bit devices will
@@ -332,6 +395,9 @@ void __init arm64_memblock_init(void)
arm64_dma_phys_limit = max_zone_dma_phys();
else
arm64_dma_phys_limit = PHYS_MASK + 1;
+
+ reserve_crashkernel();
+
dma_contiguous_reserve(arm64_dma_phys_limit);

memblock_allow_resize();
--
2.11.1
AKASHI Takahiro
2017-03-17 11:31:18 UTC
Permalink
+pr_info("crashkernel reserved: 0x%016llx - 0x%016llx (%lld MB)\n",
+crash_base, crash_base + crash_size, crash_size >> 20);
There's a typo there — it says MB but you mean MiB.
Unless you meant crash_size / 1000000 and not crash_size >> 20?
Yes and no.
This notation is consistent with other places like mem_init()
in mm/init.c.

Thanks,
-Takahiro AKASHI
David Woodhouse
2017-03-17 11:32:35 UTC
Permalink
Post by AKASHI Takahiro
Yes and no.
This notation is consistent with other places like mem_init()
in mm/init.c.]
Well, perhaps we should fix those too then. But we certainly shouldn't
add *more* errors.
AKASHI Takahiro
2017-03-15 10:01:19 UTC
Permalink
From: Sameer Goel <***@codeaurora.org>

In cases where a device tree is not provided (ie ACPI based system), an
empty fdt is generated by efistub. #address-cells and #size-cells are not
set in the empty fdt, so they default to 1 (4 byte wide). This can be an
issue on 64-bit systems where values representing addresses, etc may be
8 bytes wide as the default value does not align with the general
requirements for an empty DTB, and is fragile when passed to other agents
as extra care is required to read the entire width of a value.

This issue is observed on Qualcomm Technologies QDF24XX platforms when
kexec-tools inserts 64-bit addresses into the "linux,elfcorehdr" and
"linux,usable-memory-range" properties of the fdt. When the values are
later consumed, they are truncated to 32-bit.

Setting #address-cells and #size-cells to 2 at creation of the empty fdt
resolves the observed issue, and makes the fdt less fragile.

Signed-off-by: Sameer Goel <***@codeaurora.org>
Signed-off-by: Jeffrey Hugo <***@codeaurora.org>
---
drivers/firmware/efi/libstub/fdt.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/libstub/fdt.c b/drivers/firmware/efi/libstub/fdt.c
index 260c4b4b492e..82973b86efe4 100644
--- a/drivers/firmware/efi/libstub/fdt.c
+++ b/drivers/firmware/efi/libstub/fdt.c
@@ -16,6 +16,22 @@

#include "efistub.h"

+#define EFI_DT_ADDR_CELLS_DEFAULT 2
+#define EFI_DT_SIZE_CELLS_DEFAULT 2
+
+static void fdt_update_cell_size(efi_system_table_t *sys_table, void *fdt)
+{
+ int offset;
+
+ offset = fdt_path_offset(fdt, "/");
+ /* Set the #address-cells and #size-cells values for an empty tree */
+
+ fdt_setprop_u32(fdt, offset, "#address-cells",
+ EFI_DT_ADDR_CELLS_DEFAULT);
+
+ fdt_setprop_u32(fdt, offset, "#size-cells", EFI_DT_SIZE_CELLS_DEFAULT);
+}
+
static efi_status_t update_fdt(efi_system_table_t *sys_table, void *orig_fdt,
unsigned long orig_fdt_size,
void *fdt, int new_fdt_size, char *cmdline_ptr,
@@ -42,10 +58,18 @@ static efi_status_t update_fdt(efi_system_table_t *sys_table, void *orig_fdt,
}
}

- if (orig_fdt)
+ if (orig_fdt) {
status = fdt_open_into(orig_fdt, fdt, new_fdt_size);
- else
+ } else {
status = fdt_create_empty_tree(fdt, new_fdt_size);
+ if (status == 0) {
+ /*
+ * Any failure from the following function is non
+ * critical
+ */
+ fdt_update_cell_size(sys_table, fdt);
+ }
+ }

if (status != 0)
goto fdt_set_fail;
--
2.11.1
AKASHI Takahiro
2017-03-15 09:59:40 UTC
Permalink
Kdump is enabled by default as kexec is.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/configs/defconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 7c48028ec64a..927ee18bbdf2 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -82,6 +82,7 @@ CONFIG_CMA=y
CONFIG_SECCOMP=y
CONFIG_XEN=y
CONFIG_KEXEC=y
+CONFIG_CRASH_DUMP=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_COMPAT=y
CONFIG_CPU_IDLE=y
--
2.11.1
AKASHI Takahiro
2017-03-15 10:00:49 UTC
Permalink
From: James Morse <***@arm.com>

Add documentation for DT properties:
linux,usable-memory-range
linux,elfcorehdr
used by arm64 kdump. Those are, respectively, a usable memory range
allocated to crash dump kernel and the elfcorehdr's location within it.

Signed-off-by: James Morse <***@arm.com>
[***@linaro.org: update the text due to recent changes ]
Signed-off-by: AKASHI Takahiro <***@linaro.org>
Acked-by: Mark Rutland <***@arm.com>
Cc: ***@vger.kernel.org
Cc: Rob Herring <robh+***@kernel.org>
---
Documentation/devicetree/bindings/chosen.txt | 45 ++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)

diff --git a/Documentation/devicetree/bindings/chosen.txt b/Documentation/devicetree/bindings/chosen.txt
index 6ae9d82d4c37..b5e39af4ddc0 100644
--- a/Documentation/devicetree/bindings/chosen.txt
+++ b/Documentation/devicetree/bindings/chosen.txt
@@ -52,3 +52,48 @@ This property is set (currently only on PowerPC, and only needed on
book3e) by some versions of kexec-tools to tell the new kernel that it
is being booted by kexec, as the booting environment may differ (e.g.
a different secondary CPU release mechanism)
+
+linux,usable-memory-range
+-------------------------
+
+This property (arm64 only) holds a base address and size, describing a
+limited region in which memory may be considered available for use by
+the kernel. Memory outside of this range is not available for use.
+
+This property describes a limitation: memory within this range is only
+valid when also described through another mechanism that the kernel
+would otherwise use to determine available memory (e.g. memory nodes
+or the EFI memory map). Valid memory may be sparse within the range.
+e.g.
+
+/ {
+ chosen {
+ linux,usable-memory-range = <0x9 0xf0000000 0x0 0x10000000>;
+ };
+};
+
+The main usage is for crash dump kernel to identify its own usable
+memory and exclude, at its boot time, any other memory areas that are
+part of the panicked kernel's memory.
+
+While this property does not represent a real hardware, the address
+and the size are expressed in #address-cells and #size-cells,
+respectively, of the root node.
+
+linux,elfcorehdr
+----------------
+
+This property (currently used only on arm64) holds the memory range,
+the address and the size, of the elf core header which mainly describes
+the panicked kernel's memory layout as PT_LOAD segments of elf format.
+e.g.
+
+/ {
+ chosen {
+ linux,elfcorehdr = <0x9 0xfffff000 0x0 0x800>;
+ };
+};
+
+While this property does not represent a real hardware, the address
+and the size are expressed in #address-cells and #size-cells,
+respectively, of the root node.
--
2.11.1
AKASHI Takahiro
2017-03-15 09:59:39 UTC
Permalink
Arch-specific functions are added to allow for implementing a crash dump
file interface, /proc/vmcore, which can be viewed as a ELF file.

A user space tool, like kexec-tools, is responsible for allocating
a separate region for the core's ELF header within crash kdump kernel
memory and filling it in when executing kexec_load().

Then, its location will be advertised to crash dump kernel via a new
device-tree property, "linux,elfcorehdr", and crash dump kernel preserves
the region for later use with reserve_elfcorehdr() at boot time.

On crash dump kernel, /proc/vmcore will access the primary kernel's memory
with copy_oldmem_page(), which feeds the data page-by-page by ioremap'ing
it since it does not reside in linear mapping on crash dump kernel.

Meanwhile, elfcorehdr_read() is simple as the region is always mapped.

Signed-off-by: AKASHI Takahiro <***@linaro.org>
Reviewed-by: James Morse <***@arm.com>
Acked-by: Catalin Marinas <***@arm.com>
---
arch/arm64/Kconfig | 11 +++++++
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/crash_dump.c | 71 ++++++++++++++++++++++++++++++++++++++++++
arch/arm64/mm/init.c | 53 +++++++++++++++++++++++++++++++
4 files changed, 136 insertions(+)
create mode 100644 arch/arm64/kernel/crash_dump.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 8c7c244247b6..f23417fcfc8a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -736,6 +736,17 @@ config KEXEC
but it is independent of the system firmware. And like a reboot
you can start any kernel with it, not just Linux.

+config CRASH_DUMP
+ bool "Build kdump crash kernel"
+ help
+ Generate crash dump after being started by kexec. This should
+ be normally only set in special crash dump kernels which are
+ loaded in the main kernel with kexec-tools into a specially
+ reserved region and then later executed after a crash by
+ kdump/kexec.
+
+ For more details see Documentation/kdump/kdump.txt
+
config XEN_DOM0
def_bool y
depends on XEN
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 1606c6b2a280..0fe44a7513ff 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -50,6 +50,7 @@ arm64-obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
arm64-obj-$(CONFIG_HIBERNATION) += hibernate.o hibernate-asm.o
arm64-obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o \
cpu-reset.o
+arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o

obj-y += $(arm64-obj-y) vdso/ probes/
obj-m += $(arm64-obj-m)
diff --git a/arch/arm64/kernel/crash_dump.c b/arch/arm64/kernel/crash_dump.c
new file mode 100644
index 000000000000..f46d57c31443
--- /dev/null
+++ b/arch/arm64/kernel/crash_dump.c
@@ -0,0 +1,71 @@
+/*
+ * Routines for doing kexec-based kdump
+ *
+ * Copyright (C) 2017 Linaro Limited
+ * Author: AKASHI Takahiro <***@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/crash_dump.h>
+#include <linux/errno.h>
+#include <linux/io.h>
+#include <linux/memblock.h>
+#include <linux/uaccess.h>
+#include <asm/memory.h>
+
+/**
+ * copy_oldmem_page() - copy one page from old kernel memory
+ * @pfn: page frame number to be copied
+ * @buf: buffer where the copied page is placed
+ * @csize: number of bytes to copy
+ * @offset: offset in bytes into the page
+ * @userbuf: if set, @buf is in a user address space
+ *
+ * This function copies one page from old kernel memory into buffer pointed by
+ * @buf. If @buf is in userspace, set @userbuf to %1. Returns number of bytes
+ * copied or negative error in case of failure.
+ */
+ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
+ size_t csize, unsigned long offset,
+ int userbuf)
+{
+ void *vaddr;
+
+ if (!csize)
+ return 0;
+
+ vaddr = memremap(__pfn_to_phys(pfn), PAGE_SIZE, MEMREMAP_WB);
+ if (!vaddr)
+ return -ENOMEM;
+
+ if (userbuf) {
+ if (copy_to_user((char __user *)buf, vaddr + offset, csize)) {
+ memunmap(vaddr);
+ return -EFAULT;
+ }
+ } else {
+ memcpy(buf, vaddr + offset, csize);
+ }
+
+ memunmap(vaddr);
+
+ return csize;
+}
+
+/**
+ * elfcorehdr_read - read from ELF core header
+ * @buf: buffer where the data is placed
+ * @csize: number of bytes to read
+ * @ppos: address in the memory
+ *
+ * This function reads @count bytes from elf core header which exists
+ * on crash dump kernel's memory.
+ */
+ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
+{
+ memcpy(buf, phys_to_virt((phys_addr_t)*ppos), count);
+ return count;
+}
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 89ba3cd0fe44..5960bef0170d 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -39,6 +39,7 @@
#include <linux/vmalloc.h>
#include <linux/mm.h>
#include <linux/kexec.h>
+#include <linux/crash_dump.h>

#include <asm/boot.h>
#include <asm/fixmap.h>
@@ -165,6 +166,56 @@ static void __init kexec_reserve_crashkres_pages(void)
}
#endif /* CONFIG_KEXEC_CORE */

+#ifdef CONFIG_CRASH_DUMP
+static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
+ const char *uname, int depth, void *data)
+{
+ const __be32 *reg;
+ int len;
+
+ if (depth != 1 || strcmp(uname, "chosen") != 0)
+ return 0;
+
+ reg = of_get_flat_dt_prop(node, "linux,elfcorehdr", &len);
+ if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
+ return 1;
+
+ elfcorehdr_addr = dt_mem_next_cell(dt_root_addr_cells, &reg);
+ elfcorehdr_size = dt_mem_next_cell(dt_root_size_cells, &reg);
+
+ return 1;
+}
+
+/*
+ * reserve_elfcorehdr() - reserves memory for elf core header
+ *
+ * This function reserves the memory occupied by an elf core header
+ * described in the device tree. This region contains all the
+ * information about primary kernel's core image and is used by a dump
+ * capture kernel to access the system memory on primary kernel.
+ */
+static void __init reserve_elfcorehdr(void)
+{
+ of_scan_flat_dt(early_init_dt_scan_elfcorehdr, NULL);
+
+ if (!elfcorehdr_size)
+ return;
+
+ if (memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)) {
+ pr_warn("elfcorehdr is overlapped\n");
+ return;
+ }
+
+ memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
+
+ pr_info("Reserving %lldKB of memory at 0x%llx for elfcorehdr\n",
+ elfcorehdr_size >> 10, elfcorehdr_addr);
+}
+#else
+static void __init reserve_elfcorehdr(void)
+{
+}
+#endif /* CONFIG_CRASH_DUMP */
/*
* Return the maximum physical address for ZONE_DMA (DMA_BIT_MASK(32)). It
* currently assumes that for memory starting above 4G, 32-bit devices will
@@ -423,6 +474,8 @@ void __init arm64_memblock_init(void)

reserve_crashkernel();

+ reserve_elfcorehdr();
+
dma_contiguous_reserve(arm64_dma_phys_limit);

memblock_allow_resize();
--
2.11.1
David Woodhouse
2017-03-15 11:41:18 UTC
Permalink
Post by AKASHI Takahiro
This patch series adds kdump support on arm64.
To load a crash-dump kernel to the systems, a series of patches to
kexec-tools[1] are also needed. Please use the latest one, v6 [2].
   https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
   https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump
To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
  - crash utility (v7.1.8 or later) [3]
I tested this patchset on fast model and hikey.
Please build with CONFIG_DEBUG_SECTION_MISMATCH=y.

It might be because I'm using your v32 patchset, hacked late at night
to apply to a 4.9 kernel, but I see this:

WARNING: vmlinux.o(.text+0x10240): Section mismatch in reference from the function arch_kexec_protect_crashkres() to the function .init.text:create_pgd_mapping()
The function arch_kexec_protect_crashkres() references
the function __init create_pgd_mapping().
This is often because arch_kexec_protect_crashkres lacks a __init 
annotation or the annotation of create_pgd_mapping is wrong.

WARNING: vmlinux.o(.text+0x102b0): Section mismatch in reference from the function arch_kexec_unprotect_crashkres() to the function .init.text:create_pgd_mapping()
The function arch_kexec_unprotect_crashkres() references
the function __init create_pgd_mapping().
This is often because arch_kexec_unprotect_crashkres lacks a __init 
annotation or the annotation of create_pgd_mapping is wrong.
AKASHI Takahiro
2017-03-16 00:23:48 UTC
Permalink
Post by David Woodhouse
Post by AKASHI Takahiro
This patch series adds kdump support on arm64.
To load a crash-dump kernel to the systems, a series of patches to
kexec-tools[1] are also needed. Please use the latest one, v6 [2].
   https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
   https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump
To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
  - crash utility (v7.1.8 or later) [3]
I tested this patchset on fast model and hikey.
Please build with CONFIG_DEBUG_SECTION_MISMATCH=y.
It might be because I'm using your v32 patchset, hacked late at night
WARNING: vmlinux.o(.text+0x10240): Section mismatch in reference from the function arch_kexec_protect_crashkres() to the function .init.text:create_pgd_mapping()
The function arch_kexec_protect_crashkres() references
the function __init create_pgd_mapping().
This is often because arch_kexec_protect_crashkres lacks a __init 
annotation or the annotation of create_pgd_mapping is wrong.
WARNING: vmlinux.o(.text+0x102b0): Section mismatch in reference from the function arch_kexec_unprotect_crashkres() to the function .init.text:create_pgd_mapping()
The function arch_kexec_unprotect_crashkres() references
the function __init create_pgd_mapping().
This is often because arch_kexec_unprotect_crashkres lacks a __init 
annotation or the annotation of create_pgd_mapping is wrong.
I double-checked but saw no warnings like these neither for v32 nor v33.
I'm afraid that you might have done something wrong in backporting,
particularly patch#5.

Thanks,
-Takahiro AKASHI
David Woodhouse
2017-03-16 10:29:45 UTC
Permalink
Post by AKASHI Takahiro
I double-checked but saw no warnings like these neither for v32 nor v33.
I'm afraid that you might have done something wrong in backporting,
particularly patch#5.
Then I apologise for the noise. Thanks for checking.
David Woodhouse
2017-03-17 14:02:53 UTC
Permalink
Is this one going to be be my fault too?
Looks like it isn't my fault. In ipi_cpu_crash_stop() we don't modify
the online mask. Which is reasonable enough if we want to preserve its
original contents from before the crash, but it does make that
WARN_ON() in machine_kexec() a false positive.

Btw, why is this a normal IPI and not something... less maskable?
On x86 we use NMI for that...
Mark Rutland
2017-03-17 15:04:12 UTC
Permalink
Post by David Woodhouse
Is this one going to be be my fault too?
Looks like it isn't my fault. In ipi_cpu_crash_stop() we don't modify
the online mask. Which is reasonable enough if we want to preserve its
original contents from before the crash, but it does make that
WARN_ON() in machine_kexec() a false positive.
Btw, why is this a normal IPI and not something... less maskable?
On x86 we use NMI for that...
Architecturally, arm64 does not have an NMI.

There's been some work to try to get pseudo-NMIs using GICv3 priorities,
but that's about the closest we can get today.

Thanks,
Mark.
Mark Rutland
2017-03-17 15:33:58 UTC
Permalink
Post by David Woodhouse
Is this one going to be be my fault too?
Looks like it isn't my fault. In ipi_cpu_crash_stop() we don't modify
the online mask. Which is reasonable enough if we want to preserve its
original contents from before the crash, but it does make that
WARN_ON() in machine_kexec() a false positive.
I'd say it's not so much a false positive, but rather an uninformative
message.

Some warning here is completely appropriate. Even if the CPUs are
stopped in the kernel, there are a number of cases where the HW can
corrupt system state in the background.

We can certainly log a better message, e.g.

bool kdump = (image == kexec_crash_image);
bool stuck_cpus = cpus_are_stuck_in_kernel() ||
num_online_cpus() > 1;

BUG_ON(stuck_cpus && !kdump);
WARN(stuck_cpus, "Unable to offline CPUs, kdump will be unreliable.\n");

Thanks,
Mark.
Mark Rutland
2017-03-17 16:24:21 UTC
Permalink
No, in this case the CPUs *were* offlined correctly, or at least "as
designed", by smp_send_crash_stop(). And if that hadn't worked, as
verified by *its* synchronisation method based on the atomic_t
if (atomic_read(&waiting_for_crash_ipi) > 0)
pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
   cpumask_pr_args(cpu_online_mask));
It's just that smp_send_crash_stop() (or more specifically
ipi_cpu_crash_stop()) doesn't touch the online cpu mask. Unlike the
ARM32 equivalent function machien_crash_nonpanic_core(), which does.
It wasn't clear if that was *intentional*, to allow the original
contents of the online mask before the crash to be seen in the
resulting vmcore... or purely an accident. 
Looking at this, there's a larger mess.

The waiting_for_crash_ipi dance only tells us if CPUs have taken the
IPI, not wether they've been offlined (i.e. actually left the kernel).
We need something closer to the usual cpu_{disable,die,kill} dance,
clearing online as appropriate.

If CPUs haven't left the kernel, we still need to warn about that.
FWIW if I trigger a crash on CPU 1 my kdump (still 4.9.8+v32) doesn't work.
I end up booting the kdump kernel on CPU#1 and then it gets distinctly unhappy...
[    0.000000] Booting Linux on physical CPU 0x1
...
[    0.017125] Detected PIPT I-cache on CPU1
[    0.017138] GICv3: CPU1: found redistributor 0 region 0:0x00000000f0280000
[    0.017147] CPU1: Booted secondary processor [411fd073]
[    0.017339] Detected PIPT I-cache on CPU2
[    0.017347] GICv3: CPU2: found redistributor 2 region 0:0x00000000f02c0000
[    0.017354] CPU2: Booted secondary processor [411fd073]
[    0.017537] Detected PIPT I-cache on CPU3
[    0.017545] GICv3: CPU3: found redistributor 3 region 0:0x00000000f02e0000
[    0.017551] CPU3: Booted secondary processor [411fd073]
[    0.017576] Brought up 4 CPUs
[    0.017587] SMP: Total of 4 processors activated.
...
[   31.751299]  1-...: (30 GPs behind) idle=c90/0/0 softirq=0/0 fqs=0 
[   31.757557]  2-...: (30 GPs behind) idle=608/0/0 softirq=0/0 fqs=0 
[   31.763814]  3-...: (30 GPs behind) idle=604/0/0 softirq=0/0 fqs=0 
[   31.770069]  (detected by 0, t=5252 jiffies, g=-270, c=-271, q=0)
[   31.779381] swapper/1       R  running task        0     0      1 0x00000080
[   31.789666] swapper/2       R  running task        0     0      1 0x00000080
[   31.799945] swapper/3       R  running task        0     0      1 0x00000080
Is some of that platform-specific?
That sounds like timer interrupts aren't being taken.

Given that the CPUs have come up, my suspicion would be that the GIC's
been left in some odd state, that the kdump kernel hasn't managed to
recover from.

Marc may have an idea.

Thanks,
Mark.
Marc Zyngier
2017-03-17 16:59:28 UTC
Permalink
Post by Mark Rutland
No, in this case the CPUs *were* offlined correctly, or at least "as
designed", by smp_send_crash_stop(). And if that hadn't worked, as
verified by *its* synchronisation method based on the atomic_t
if (atomic_read(&waiting_for_crash_ipi) > 0)
pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
cpumask_pr_args(cpu_online_mask));
It's just that smp_send_crash_stop() (or more specifically
ipi_cpu_crash_stop()) doesn't touch the online cpu mask. Unlike the
ARM32 equivalent function machien_crash_nonpanic_core(), which does.
It wasn't clear if that was *intentional*, to allow the original
contents of the online mask before the crash to be seen in the
resulting vmcore... or purely an accident.
Looking at this, there's a larger mess.
The waiting_for_crash_ipi dance only tells us if CPUs have taken the
IPI, not wether they've been offlined (i.e. actually left the kernel).
We need something closer to the usual cpu_{disable,die,kill} dance,
clearing online as appropriate.
If CPUs haven't left the kernel, we still need to warn about that.
FWIW if I trigger a crash on CPU 1 my kdump (still 4.9.8+v32) doesn't work.
I end up booting the kdump kernel on CPU#1 and then it gets distinctly unhappy...
[ 0.000000] Booting Linux on physical CPU 0x1
...
[ 0.017125] Detected PIPT I-cache on CPU1
[ 0.017138] GICv3: CPU1: found redistributor 0 region 0:0x00000000f0280000
[ 0.017147] CPU1: Booted secondary processor [411fd073]
[ 0.017339] Detected PIPT I-cache on CPU2
[ 0.017347] GICv3: CPU2: found redistributor 2 region 0:0x00000000f02c0000
[ 0.017354] CPU2: Booted secondary processor [411fd073]
[ 0.017537] Detected PIPT I-cache on CPU3
[ 0.017545] GICv3: CPU3: found redistributor 3 region 0:0x00000000f02e0000
[ 0.017551] CPU3: Booted secondary processor [411fd073]
[ 0.017576] Brought up 4 CPUs
[ 0.017587] SMP: Total of 4 processors activated.
...
[ 31.751299] 1-...: (30 GPs behind) idle=c90/0/0 softirq=0/0 fqs=0
[ 31.757557] 2-...: (30 GPs behind) idle=608/0/0 softirq=0/0 fqs=0
[ 31.763814] 3-...: (30 GPs behind) idle=604/0/0 softirq=0/0 fqs=0
[ 31.770069] (detected by 0, t=5252 jiffies, g=-270, c=-271, q=0)
[ 31.779381] swapper/1 R running task 0 0 1 0x00000080
[ 31.789666] swapper/2 R running task 0 0 1 0x00000080
[ 31.799945] swapper/3 R running task 0 0 1 0x00000080
Is some of that platform-specific?
That sounds like timer interrupts aren't being taken.
Given that the CPUs have come up, my suspicion would be that the GIC's
been left in some odd state, that the kdump kernel hasn't managed to
recover from.
Marc may have an idea.
I thought kdump was UP only? Anyway, this doesn't look too good.

It would be interesting to find out whether we're still taking
interrupts. Also, being able to reproduce this on mainline would be useful.

I wonder if we don't have a bug when booting on something other than
CPU#0, possibly on a GICv3 platform... I'll give it a go.

Thanks,

M.
--
Jazz is not dead. It just smells funny...
Marc Zyngier
2017-03-17 17:10:26 UTC
Permalink
Post by Marc Zyngier
Post by Mark Rutland
No, in this case the CPUs *were* offlined correctly, or at least "as
designed", by smp_send_crash_stop(). And if that hadn't worked, as
verified by *its* synchronisation method based on the atomic_t
if (atomic_read(&waiting_for_crash_ipi) > 0)
pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
cpumask_pr_args(cpu_online_mask));
It's just that smp_send_crash_stop() (or more specifically
ipi_cpu_crash_stop()) doesn't touch the online cpu mask. Unlike the
ARM32 equivalent function machien_crash_nonpanic_core(), which does.
It wasn't clear if that was *intentional*, to allow the original
contents of the online mask before the crash to be seen in the
resulting vmcore... or purely an accident.
Looking at this, there's a larger mess.
The waiting_for_crash_ipi dance only tells us if CPUs have taken the
IPI, not wether they've been offlined (i.e. actually left the kernel).
We need something closer to the usual cpu_{disable,die,kill} dance,
clearing online as appropriate.
If CPUs haven't left the kernel, we still need to warn about that.
FWIW if I trigger a crash on CPU 1 my kdump (still 4.9.8+v32) doesn't work.
I end up booting the kdump kernel on CPU#1 and then it gets distinctly unhappy...
[ 0.000000] Booting Linux on physical CPU 0x1
...
[ 0.017125] Detected PIPT I-cache on CPU1
[ 0.017138] GICv3: CPU1: found redistributor 0 region 0:0x00000000f0280000
[ 0.017147] CPU1: Booted secondary processor [411fd073]
[ 0.017339] Detected PIPT I-cache on CPU2
[ 0.017347] GICv3: CPU2: found redistributor 2 region 0:0x00000000f02c0000
[ 0.017354] CPU2: Booted secondary processor [411fd073]
[ 0.017537] Detected PIPT I-cache on CPU3
[ 0.017545] GICv3: CPU3: found redistributor 3 region 0:0x00000000f02e0000
[ 0.017551] CPU3: Booted secondary processor [411fd073]
[ 0.017576] Brought up 4 CPUs
[ 0.017587] SMP: Total of 4 processors activated.
...
[ 31.751299] 1-...: (30 GPs behind) idle=c90/0/0 softirq=0/0 fqs=0
[ 31.757557] 2-...: (30 GPs behind) idle=608/0/0 softirq=0/0 fqs=0
[ 31.763814] 3-...: (30 GPs behind) idle=604/0/0 softirq=0/0 fqs=0
[ 31.770069] (detected by 0, t=5252 jiffies, g=-270, c=-271, q=0)
[ 31.779381] swapper/1 R running task 0 0 1 0x00000080
[ 31.789666] swapper/2 R running task 0 0 1 0x00000080
[ 31.799945] swapper/3 R running task 0 0 1 0x00000080
Is some of that platform-specific?
That sounds like timer interrupts aren't being taken.
Given that the CPUs have come up, my suspicion would be that the GIC's
been left in some odd state, that the kdump kernel hasn't managed to
recover from.
Marc may have an idea.
I thought kdump was UP only? Anyway, this doesn't look too good.
It would be interesting to find out whether we're still taking
interrupts. Also, being able to reproduce this on mainline would be useful.
I wonder if we don't have a bug when booting on something other than
CPU#0, possibly on a GICv3 platform... I'll give it a go.
Went ahead and tried a couple of kexecs with various CPUs disabled in
order to force kexec not to boot on CPU#0, and the VM did boot just fine.

So I'd really appreciate a mainline reproducer.

Thanks,

M.
--
Jazz is not dead. It just smells funny...
David Woodhouse
2017-03-17 20:03:11 UTC
Permalink
Post by Marc Zyngier
Post by Marc Zyngier
Post by Mark Rutland
FWIW if I trigger a crash on CPU 1 my kdump (still 4.9.8+v32) doesn't work.
I end up booting the kdump kernel on CPU#1 and then it gets distinctly unhappy...
[    0.000000] Booting Linux on physical CPU 0x1
...
[    0.017125] Detected PIPT I-cache on CPU1
[    0.017138] GICv3: CPU1: found redistributor 0 region 0:0x00000000f0280000
[    0.017147] CPU1: Booted secondary processor [411fd073]
[    0.017339] Detected PIPT I-cache on CPU2
[    0.017347] GICv3: CPU2: found redistributor 2 region 0:0x00000000f02c0000
[    0.017354] CPU2: Booted secondary processor [411fd073]
[    0.017537] Detected PIPT I-cache on CPU3
[    0.017545] GICv3: CPU3: found redistributor 3 region 0:0x00000000f02e0000
[    0.017551] CPU3: Booted secondary processor [411fd073]
[    0.017576] Brought up 4 CPUs
[    0.017587] SMP: Total of 4 processors activated.
...
[   31.751299]  1-...: (30 GPs behind) idle=c90/0/0 softirq=0/0 fqs=0 
[   31.757557]  2-...: (30 GPs behind) idle=608/0/0 softirq=0/0 fqs=0 
[   31.763814]  3-...: (30 GPs behind) idle=604/0/0 softirq=0/0 fqs=0 
[   31.770069]  (detected by 0, t=5252 jiffies, g=-270, c=-271, q=0)
[   31.779381] swapper/1       R  running task        0     0      1 0x00000080
[   31.789666] swapper/2       R  running task        0     0      1 0x00000080
[   31.799945] swapper/3       R  running task        0     0      1 0x00000080
Is some of that platform-specific?
That sounds like timer interrupts aren't being taken.
Given that the CPUs have come up, my suspicion would be that the GIC's
been left in some odd state, that the kdump kernel hasn't managed to
recover from.
Marc may have an idea.
I thought kdump was UP only? Anyway, this doesn't look too good.
It would be interesting to find out whether we're still taking
interrupts. Also, being able to reproduce this on mainline would be useful.
I wonder if we don't have a bug when booting on something other than
CPU#0, possibly on a GICv3 platform... I'll give it a go.
Went ahead and tried a couple of kexecs with various CPUs disabled in
order to force kexec not to boot on CPU#0, and the VM did boot just fine.
So I'd really appreciate a mainline reproducer.
I booted an up-to-date 4.11-rc2 kernel with the v33 patch set. I cannot
reproduce.

But then again, I can't reproduce it on 4.9 *either* any more. And that
is precisely the same kernel image I uploaded earlier. So it appears to
be sporadic, and just *happened* to hit me the first time I tried...
which is probably just as well or I'd never have tried that again :)

I'll keep trying.
AKASHI Takahiro
2017-03-21 07:34:53 UTC
Permalink
Post by Mark Rutland
No, in this case the CPUs *were* offlined correctly, or at least "as
designed", by smp_send_crash_stop(). And if that hadn't worked, as
verified by *its* synchronisation method based on the atomic_t
if (atomic_read(&waiting_for_crash_ipi) > 0)
pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
   cpumask_pr_args(cpu_online_mask));
It's just that smp_send_crash_stop() (or more specifically
ipi_cpu_crash_stop()) doesn't touch the online cpu mask. Unlike the
ARM32 equivalent function machien_crash_nonpanic_core(), which does.
It wasn't clear if that was *intentional*, to allow the original
contents of the online mask before the crash to be seen in the
resulting vmcore... or purely an accident. 
Yes, it is intentional. I removed 'offline' code in my v14 (2016/3/4).
As you assumed, I'd expect 'online' status of all CPUs to be kept
unchanged in the core dump.

If you can agree, I would like to modify this disputed warning code to:

===8<===
diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index cea009f2657d..55f08c5acfad 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -149,6 +149,7 @@ static inline void cpu_panic_kernel(void)
bool cpus_are_stuck_in_kernel(void);

extern void smp_send_crash_stop(void);
+extern bool smp_crash_stop_failed(void);

#endif /* ifndef __ASSEMBLY__ */

diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 68b96ea13b4c..29e1cf8cca95 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -146,12 +146,15 @@ void machine_kexec(struct kimage *kimage)
{
phys_addr_t reboot_code_buffer_phys;
void *reboot_code_buffer;
+ bool in_kexec_crash = (kimage == kexec_crash_image);
+ bool stuck_cpus = cpus_are_stuck_in_kernel();

/*
* New cpus may have become stuck_in_kernel after we loaded the image.
*/
- BUG_ON((cpus_are_stuck_in_kernel() || (num_online_cpus() > 1)) &&
- !WARN_ON(kimage == kexec_crash_image));
+ BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1)));
+ WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
+ "Some CPUs may be stale, kdump will be unreliable.\n");

reboot_code_buffer_phys = page_to_phys(kimage->control_code_page);
reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index a7e2921143c4..8016914591d2 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -833,7 +833,7 @@ static void ipi_cpu_stop(unsigned int cpu)
}

#ifdef CONFIG_KEXEC_CORE
-static atomic_t waiting_for_crash_ipi;
+static atomic_t waiting_for_crash_ipi = ATOMIC_INIT(0);
#endif

static void ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
@@ -990,7 +990,12 @@ void smp_send_crash_stop(void)

if (atomic_read(&waiting_for_crash_ipi) > 0)
pr_warning("SMP: failed to stop secondary CPUs %*pbl\n",
- cpumask_pr_args(cpu_online_mask));
+ cpumask_pr_args(&mask));
+}
+
+bool smp_crash_stop_failed(void)
+{
+ return (atomic_read(&waiting_for_crash_ipi) > 0);
}
#endif
===>8===



In case of failure in offlining, this can generate such a message like:

SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 0,2-7
Starting crashdump kernel...
Some CPUs may be stale, kdump will be unreliable.
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1141 at /home/akashi/arm/armv8/linaro/linux-aarch64/arch/arm64/kernel/machine_kexec.c:157 machine_kexec+0x44/0x280
Post by Mark Rutland
Looking at this, there's a larger mess.
The waiting_for_crash_ipi dance only tells us if CPUs have taken the
IPI, not wether they've been offlined (i.e. actually left the kernel).
We need something closer to the usual cpu_{disable,die,kill} dance,
clearing online as appropriate.
First, I don't think there is no sure way to confirm whether CPUs
successfully left the kernel.
Even if we do something like this in ipi_cpu_crash_stop():
atomic_dec(&waiting_for_crash_ipi);
cpu_die(cpu);
atomic_inc(&waiting_for_crash_ipi);
there is no guarantee that we reach the second update_cpu_boot_status()
in failure of cpu_die().

Second, while "graceful" cpu shutdown would be fine, the basic idea
in kdump design, I believe, is that we should do minimum things needed
and tear down all the cpus as quickly as possible in order not only to make
the reboot more successful but also to retain the kernel state (memory contents)
as close as to the moment at the panic. (The latter is arguable.)

That said, I will appreciate you if you have any suggestions
regarding what be added for safer shutdown here.


Thanks,
-Takahiro AKASHI
Post by Mark Rutland
If CPUs haven't left the kernel, we still need to warn about that.
FWIW if I trigger a crash on CPU 1 my kdump (still 4.9.8+v32) doesn't work.
I end up booting the kdump kernel on CPU#1 and then it gets distinctly unhappy...
[    0.000000] Booting Linux on physical CPU 0x1
...
[    0.017125] Detected PIPT I-cache on CPU1
[    0.017138] GICv3: CPU1: found redistributor 0 region 0:0x00000000f0280000
[    0.017147] CPU1: Booted secondary processor [411fd073]
[    0.017339] Detected PIPT I-cache on CPU2
[    0.017347] GICv3: CPU2: found redistributor 2 region 0:0x00000000f02c0000
[    0.017354] CPU2: Booted secondary processor [411fd073]
[    0.017537] Detected PIPT I-cache on CPU3
[    0.017545] GICv3: CPU3: found redistributor 3 region 0:0x00000000f02e0000
[    0.017551] CPU3: Booted secondary processor [411fd073]
[    0.017576] Brought up 4 CPUs
[    0.017587] SMP: Total of 4 processors activated.
...
[   31.751299]  1-...: (30 GPs behind) idle=c90/0/0 softirq=0/0 fqs=0 
[   31.757557]  2-...: (30 GPs behind) idle=608/0/0 softirq=0/0 fqs=0 
[   31.763814]  3-...: (30 GPs behind) idle=604/0/0 softirq=0/0 fqs=0 
[   31.770069]  (detected by 0, t=5252 jiffies, g=-270, c=-271, q=0)
[   31.779381] swapper/1       R  running task        0     0      1 0x00000080
[   31.789666] swapper/2       R  running task        0     0      1 0x00000080
[   31.799945] swapper/3       R  running task        0     0      1 0x00000080
Is some of that platform-specific?
That sounds like timer interrupts aren't being taken.
Given that the CPUs have come up, my suspicion would be that the GIC's
been left in some odd state, that the kdump kernel hasn't managed to
recover from.
Marc may have an idea.
Thanks,
Mark.
David Woodhouse
2017-03-21 09:42:23 UTC
Permalink
Post by AKASHI Takahiro
Yes, it is intentional. I removed 'offline' code in my v14 (2016/3/4).
As you assumed, I'd expect 'online' status of all CPUs to be kept
unchanged in the core dump.
I wonder if it would be better to take a *copy* of it and put it back
after we're done taking the CPUs down? As things stand, we now have
*three* different methods of taking down all the CPUs... and *none* of
them allow a platform to override it with an NMI-based or STONITH-based
method, which seems like something of an oversight.
Post by AKASHI Takahiro
 
+ BUG_ON(!in_kexec_crash && (stuck_cpus || (num_online_cpus() > 1)));
+ WARN(in_kexec_crash && (stuck_cpus || smp_crash_stop_failed()),
+ "Some CPUs may be stale, kdump will be unreliable.\n");
That works; thanks.

FWIW I'm currently blaming my platform's firmware for my sporadic
crash-on-CPU#1 failures. If your testing includes crashes on non-boot
CPUs (perhaps using the sysrq hack I posted) and it reliably passes for
you, then let's ignore that for now.
Pratyush Anand
2017-03-20 12:42:48 UTC
Permalink
Post by AKASHI Takahiro
This patch series adds kdump support on arm64.
To load a crash-dump kernel to the systems, a series of patches to
kexec-tools[1] are also needed. Please use the latest one, v6 [2].
https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump
To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
- crash utility (v7.1.8 or later) [3]
I tested this patchset on fast model and hikey.
You have it for this version as well.

Thanks for the patchset.

~Pratyush
Goel, Sameer
2017-03-22 16:55:55 UTC
Permalink
I tested this patch set on a QDT2400 device with 4k 4-level setup. This patchset worked fine with maxcpus=1 on the first kernel.

I don't think this is an issue with the patchset, I am still triaging this issue further.
Post by AKASHI Takahiro
This patch series adds kdump support on arm64.
To load a crash-dump kernel to the systems, a series of patches to
kexec-tools[1] are also needed. Please use the latest one, v6 [2].
https://git.linaro.org/people/takahiro.akashi/linux-aarch64.git arm64/kdump
https://git.linaro.org/people/takahiro.akashi/kexec-tools.git arm64/kdump
To examine vmcore (/proc/vmcore) on a crash-dump kernel, you can use
- crash utility (v7.1.8 or later) [3]
I tested this patchset on fast model and hikey.
Tested-by: Sameer Goel (v32, QDT2400)
Changes for v33 (Mar 15, 2017)
o rebased to v4.11-rc2+
o arch_kexec_(un)protect_crashkres() now protects loaded data segments
only along with moving copying of control_code_page back to machine_kexec()
(patch #6)
o reduce the size of hibernation image when kdump and hibernation are
comfigured at the same time (patch #7)
o clearify that "linux,usable-memory-range" and "linux,elfcorehdr"
have values of the size of root node's "#address-cells" and "#size-cells"
(patch #13)
o add "efi/libstub/arm*: Set default address and size cells values for
an empty dtb" from Sameer Goel (patch #14)
(I didn't test the case though.)
Changes for v32 (Feb 7, 2017)
o isolate crash dump kernel memory as well as kernel text/data by using
MEMBLOCK_MAP attribute to and then specifically map them in map_mem()
(patch #1,6)
o delete remove_pgd_mapping() and instead modify create_pgd_mapping() to
allowing for unmapping a kernel mapping (patch #5)
o correct a commit message as well as a comment in the source (patch#10)
o other trivial changes after Mark's comments (patch#3,4)
Changes for v31 (Feb 1, 2017)
o add/use remove_pgd_mapping() instead of modifying (__)create_pgd_mapping()
to protect crash dump kernel memory (patch #4,5)
o fix an issue at the isolation of crash dump kernel memory in
map_mem()/__map_memblock(), adding map_crashkernel() (patch#5)
o preserve the contents of crash dump kernel memory around hibernation
(patch#6)
Changes for v30 (Jan 24, 2017)
o rebased to Linux-v4.10-rc5
o remove "linux,crashkernel-base/size" from exported device tree
o protect memory region for crash-dump kernel (adding patch#4,5)
o remove "in_crash_kexec" variable
o and other trivial changes
Changes for v29 (Dec 28, 2016)
o rebased to Linux-v4.10-rc1
o change asm constraints in crash_setup_regs() per Catalin
Changes for v28 (Nov 22, 2016)
o rebased to Linux-v4.9-rc6
o revamp patch #1 and merge memblock_cap_memory_range() with
memblock_mem_limit_remove_map()
Changes for v27 (Nov 1, 2016)
o rebased to Linux-v4.9-rc3
o revert v26 change, i.e. revive "linux,usable-memory-range" property
(patch #2/#3, updating patch #9)
o minor fixes per review comments (patch #3/#4/#6/#8)
o re-order patches and improve commit messages for readability
o Use /reserved-memory instead of "linux,usable-memory-range" property
(dropping v25's patch#2 and #3, updating ex-patch#9.)
o Rebase to Linux-4.8-rc4
o Use memremap() instead of ioremap_cache() [patch#5]
o Rebase to Linux-4.8-rc1
o Update descriptions about newly added DT proerties
o Move memblock_reserve() to a single place in reserve_crashkernel()
o Use cpu_park_loop() in ipi_cpu_crash_stop()
o Always enforce ARCH_LOW_ADDRESS_LIMIT to the memory range of crash kernel
o Re-implement fdt_enforce_memory_region() to remove non-reserve regions
(for ACPI) from usable memory at crash kernel
o Export "crashkernel-base" and "crashkernel-size" via device-tree,
and add some descriptions about them in chosen.txt
o Rename "usable-memory" to "usable-memory-range" to avoid inconsistency
with powerpc's "usable-memory"
o Make cosmetic changes regarding "ifdef" usage
o Correct some wordings in kdump.txt
o Remove kexec patches.
o Rebase to arm64's for-next/core (Linux-4.7-rc4 based).
o Clarify the description about kvm in kdump.txt.
See the link [4] for older changes.
[1] https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
[2] http://lists.infradead.org/pipermail/kexec/2017-March/018356.html
[3] https://github.com/crash-utility/crash.git
[4] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-June/438780.html
memblock: add memblock_clear_nomap()
memblock: add memblock_cap_memory_range()
arm64: limit memory regions based on DT property, usable-memory-range
arm64: kdump: reserve memory for crash dump kernel
arm64: mm: allow for unmapping part of kernel mapping
arm64: kdump: protect crash dump kernel memory
arm64: hibernate: preserve kdump image around hibernation
arm64: kdump: implement machine_crash_shutdown()
arm64: kdump: add VMCOREINFO's for user-space tools
arm64: kdump: provide /proc/vmcore file
arm64: kdump: enable kdump in defconfig
Documentation: kdump: describe arm64 port
Documentation: dt: chosen properties for arm64 kdump
efi/libstub/arm*: Set default address and size cells values for an
empty dtb
Documentation/devicetree/bindings/chosen.txt | 45 +++++++
Documentation/kdump/kdump.txt | 16 ++-
arch/arm64/Kconfig | 11 ++
arch/arm64/configs/defconfig | 1 +
arch/arm64/include/asm/hardirq.h | 2 +-
arch/arm64/include/asm/kexec.h | 52 +++++++-
arch/arm64/include/asm/pgtable-prot.h | 1 +
arch/arm64/include/asm/smp.h | 2 +
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/crash_dump.c | 71 +++++++++++
arch/arm64/kernel/hibernate.c | 10 +-
arch/arm64/kernel/machine_kexec.c | 170 +++++++++++++++++++++++--
arch/arm64/kernel/setup.c | 7 +-
arch/arm64/kernel/smp.c | 63 ++++++++++
arch/arm64/mm/init.c | 181 +++++++++++++++++++++++++++
arch/arm64/mm/mmu.c | 107 ++++++++--------
drivers/firmware/efi/libstub/fdt.c | 28 ++++-
include/linux/memblock.h | 2 +
mm/memblock.c | 56 ++++++---
19 files changed, 745 insertions(+), 81 deletions(-)
create mode 100644 arch/arm64/kernel/crash_dump.c
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
Loading...