Quick notes on KERNSEAL
The mysterious unreadable kernseal.txt file on PaX' documentation
page has been sitting there since
2003, described as "sealed kernel storage design & implementation." In 2006, it
was described
as:
the problem KERNSEAL sets out to solve is kernel self-protection, that is, assuming arbitrary read/write access to kernel memory (by some bug, but for all i care, it could even be a mode 777 /dev/mem as well), the goal is to prevent privilege elevation (vs. privilege abuse which is an even harder problem to solve).
After many years of KERNSEAL ETA WEN jokes on #grsecurity, it was finally
made available to grsecurity beta customers in August
2023 and
to LTS ones in January
2024. I was eagerly
expecting minipli's blogpost on the topic,
but since none got published so far, I endeavoured to read an old patch a
friend of mine was kind enough to sling my way, and take/publish some
high-level notes while waiting for it, as
apparently the
diff for pax-linux-6.2.13-test6-kernseal-only.patch is "only" 66 files
changed, 1118 insertions(+), 361 deletions(-). Odds are that most of my
understanding is completely wrong nonetheless, so take everything written here
with a mountain of salt.
The main idea behind PAX_KERNSEAL seems to be the constification of
dynamically allocated objects, a bit like
PAX_CONSTIFY_PLUGIN
is doing for static ones, as well as completely hiding some of them as well.
It depends on a couple of things to enforce its
security invariants:
- PaX' RAP, to prevent existing code out-of-(intended)-order execution, otherwise an attacker could simply ROP their way around KERNSEAL.
PAX_PRIVATE_KSTACKS, to defend against kthreads manipulating each other return addresses after RAP checks.CONFIG_PAX_PER_CPU_PGD, to prevent other kthreads from accessing temporarily unsealed pages on a given CPU.CONFIG_PAGE_TABLE_ISOLATION, of course.- Not having hibernation nor kexec support.
It introduces two new page states via
GFP
flags, and stores those properties in the struct page:
__GFP_SEALED/PG_sealed: The page is mapped read-only in the direct map (the linear mapping of all physical memory.)__GFP_HIDDEN/PG_hidden: The page is mapped invalid (completely unmapped) in the direct map, so contents can't be read or written through the normal direct-map address.
New corresponding migrate types (MIGRATE_SEALED and MIGRATE_HIDDEN) are
added to the buddy allocator, ensuring that sealed and hidden pages are grouped
together in dedicated pageblocks. Of course those types are non-mergeable,
preventing the allocator from stealing sealed/hidden blocks for normal
allocations.
When a pageblock is set up for sealed or hidden use, pax_setup_pageblock()
walks the PMD entries in the direct map and applies
pmd_wrprotect() (for sealed) or pmd_mkinvalid() (for hidden), followed by a
TLB flush. After allocation, post_alloc_hook() verifies that page flags match
the requested GFP flags (sealed pages must have PG_sealed, etc.), and updates
per-node statistics (NR_SEALED, NR_HIDDEN).
As hidden pages have no valid direct-map mapping, the kernel needs a way to
temporarily access them, which is done via pax_expose_page/pax_hide_page
pair, a bit like
KERNEXEC's
pax_open_kernel/pax_close_kernel are doing to keep the kernel code
read-only.
A dedicated KM_USER_SLOT is reserved for KERNSEAL kmap operations, and every
kmap-related call is hooked: if the page is hidden, it goes through
pax_expose_page() to create a temporary per-CPU mapping; if sealed, access is
blocked entirely with VM_BUG_ON_PAGE_ALWAYS, dumping the page and calling
BUG().
A new kmalloc cache type (KMALLOC_SEALED) is added, to allow the kernel to
allocated sealed data on a lower granularity than page-level. Temporarily
unseal capability (for initialization for example) is provided by
pax_open_seal()/pax_close_seal(), which are simply wrappers around
pax_open_kernel and pax_close_kernel.
The most obvious usage of KERNSEAL on the patch I have is on struct cred:
The mutable fields (usage, rcu, non_rcu) are split into a separate
struct cred_rw, while the cred structure itself is marked
__mutable_const, with the rw portion being actually a pointer to
separately-allocated mutable memory:
*/
struct cred {
- atomic_long_t usage;
+ struct cred_rw {
+ /* RCU deletion */
+ union {
+ int non_rcu; /* Can we skip RCU deletion? */
+ struct rcu_head rcu; /* RCU deletion hook */
+ };
+#ifdef CONFIG_PAX_KERNSEAL
+ struct cred *cred;
+#endif
+ atomic_long_t usage;
+ }
+#ifdef CONFIG_PAX_KERNSEAL
+ *rw;
+#else
+ _rw;
+#endif
// [...]
+#ifdef CONFIG_PAX_KERNSEAL
+} __randomize_layout __mutable_const;
+#else
} __randomize_layout;
+#endif
and used like this:
+#ifdef CONFIG_PAX_KERNSEAL
+#define to_cred_rw(cred) (cred->rw)
+#define to_cred(cred_rw) (cred_rw->cred)
+#else
+#define to_cred_rw(cred) (&cred->_rw)
+#define to_cred(cred_rw) (container_of(cred_rw, struct cred, _rw))
+#endif
+
This means the credential's security-sensitive fields (UIDs, GIDs, capabilities) live on sealed pages and cannot be tampered with, while the reference count and RCU linkage live on normal writable memory.
Debugging-wise, When a page fault occurs on a direct-mapped address, the fault handler checks whether the page is sealed or hidden and provides a clear diagnostic:
+ if (is_direct_mapped_addr((void *)address)) {
+ struct page *page = virt_to_page((void *)address);
+ if (PageSealed(page) || PageHidden(page)) {
+ pr_alert("BUG: unable to handle page fault for %s page at %pS\n",
+ PageSealed(page) ? "sealed" : "hidden", (void *)address);
Moreover, sealed and hidden page counts are exposed via /proc/meminfo and
per-node meminfo, plus per-process stats in /proc/<pid>/smaps:
Sealed:— total sealed pages in kBHidden:— total hidden pages in kB
New kpageflags bits (KPF_SEALED = 62, KPF_HIDDEN = 63) are also exported.
As for PAX_PRIVATE_KSTACKS (in the context of PAX_KERNSEAL), it creates a
per-CPU page table where each task gets a dedicated slot with guard pages. Only
the current task's stack is accessible, via dynamic PTE-level (un)mapping
magic. Underlying physical pages are allocated with __GFP_HIDDEN of course.
For stack variables that require DMA/async access, an ad-hoc GCC plugin
identifies them and stores them in a per-task dedicated page.
Even though PAX_PRIVATE_KSTACKS and PAX_KERNSEAL are conceptually simple
mitigation, they are likely super-tedious to apply to the Linux kernel code
behemoth. Tackling data-only attacks is hard, and the only other people
seriously trying to address them is Apple, with their hardware-based
KTRR/CTRR/GXF/APRR/PPL/SPTM/TXM mitigations. This makes KERNSEAL all the more
remarkable, as like everything produced by the PaX Team, it doesn't require
special hardware support.
Another interesting property of KERNSEAL is that it can serve as a basis for other interesting things, like ensuring that no guest pages are available at the hypervisor level in KVM for example. I can't wait to see what will be built on top next.
All in all, unsurprisingly, KERNSEAL is yet another all-around tour de force from the PaX Team, who keeps consistently producing stellar software-only mitigations before everyone else, since almost 25 years.