Quick notes on KERNSEAL

The mysterious unreadable kernseal.txt file on PaX' documentation page has been sitting there since 2003, described as "sealed kernel storage design & implementation." In 2006, it was described as:

the problem KERNSEAL sets out to solve is kernel self-protection, that is, assuming arbitrary read/write access to kernel memory (by some bug, but for all i care, it could even be a mode 777 /dev/mem as well), the goal is to prevent privilege elevation (vs. privilege abuse which is an even harder problem to solve).

After many years of KERNSEAL ETA WEN jokes on #grsecurity, it was finally made available to grsecurity beta customers in August 2023 and to LTS ones in January 2024. I was eagerly expecting minipli's blogpost on the topic, but since none got published so far, I endeavoured to read an old patch a friend of mine was kind enough to sling my way, and take/publish some high-level notes while waiting for it, as apparently the diff for pax-linux-6.2.13-test6-kernseal-only.patch is "only" 66 files changed, 1118 insertions(+), 361 deletions(-). Odds are that most of my understanding is completely wrong nonetheless, so take everything written here with a mountain of salt.

The main idea behind PAX_KERNSEAL seems to be the constification of dynamically allocated objects, a bit like PAX_CONSTIFY_PLUGIN is doing for static ones, as well as completely hiding some of them as well. It depends on a couple of things to enforce its security invariants:

PaX' RAP, to prevent existing code out-of-(intended)-order execution, otherwise an attacker could simply ROP their way around KERNSEAL.
PAX_PRIVATE_KSTACKS, to defend against kthreads manipulating each other return addresses after RAP checks.
CONFIG_PAX_PER_CPU_PGD, to prevent other kthreads from accessing temporarily unsealed pages on a given CPU.
CONFIG_PAGE_TABLE_ISOLATION, of course.
Not having hibernation nor kexec support.

It introduces two new page states via GFP flags, and stores those properties in the struct page:

__GFP_SEALED/PG_sealed: The page is mapped read-only in the direct map (the linear mapping of all physical memory.)
__GFP_HIDDEN/PG_hidden: The page is mapped invalid (completely unmapped) in the direct map, so contents can't be read or written through the normal direct-map address.

New corresponding migrate types (MIGRATE_SEALED and MIGRATE_HIDDEN) are added to the buddy allocator, ensuring that sealed and hidden pages are grouped together in dedicated pageblocks. Of course those types are non-mergeable, preventing the allocator from stealing sealed/hidden blocks for normal allocations.

When a pageblock is set up for sealed or hidden use, pax_setup_pageblock() walks the PMD entries in the direct map and applies pmd_wrprotect() (for sealed) or pmd_mkinvalid() (for hidden), followed by a TLB flush. After allocation, post_alloc_hook() verifies that page flags match the requested GFP flags (sealed pages must have PG_sealed, etc.), and updates per-node statistics (NR_SEALED, NR_HIDDEN).

As hidden pages have no valid direct-map mapping, the kernel needs a way to temporarily access them, which is done via pax_expose_page/pax_hide_page pair, a bit like KERNEXEC's pax_open_kernel/pax_close_kernel are doing to keep the kernel code read-only.

A dedicated KM_USER_SLOT is reserved for KERNSEAL kmap operations, and every kmap-related call is hooked: if the page is hidden, it goes through pax_expose_page() to create a temporary per-CPU mapping; if sealed, access is blocked entirely with VM_BUG_ON_PAGE_ALWAYS, dumping the page and calling BUG().

A new kmalloc cache type (KMALLOC_SEALED) is added, to allow the kernel to allocated sealed data on a lower granularity than page-level. Temporarily unseal capability (for initialization for example) is provided by pax_open_seal()/pax_close_seal(), which are simply wrappers around pax_open_kernel and pax_close_kernel.

The most obvious usage of KERNSEAL on the patch I have is on struct cred: The mutable fields (usage, rcu, non_rcu) are split into a separate struct cred_rw, while the cred structure itself is marked __mutable_const, with the rw portion being actually a pointer to separately-allocated mutable memory:

 */
 struct cred {                                                                       
-       atomic_long_t   usage;
+       struct cred_rw {
+               /* RCU deletion */
+               union {
+                       int non_rcu;            /* Can we skip RCU deletion? */
+                       struct rcu_head rcu;    /* RCU deletion hook */
+               };
+#ifdef CONFIG_PAX_KERNSEAL      
+               struct cred     *cred;
+#endif
+               atomic_long_t   usage;   
+       }
+#ifdef CONFIG_PAX_KERNSEAL
+       *rw;
+#else
+       _rw;
+#endif

// [...]

+#ifdef CONFIG_PAX_KERNSEAL
+} __randomize_layout __mutable_const;
+#else
 } __randomize_layout;
+#endif

and used like this:

+#ifdef CONFIG_PAX_KERNSEAL
+#define to_cred_rw(cred)       (cred->rw)
+#define to_cred(cred_rw)       (cred_rw->cred)                          
+#else
+#define to_cred_rw(cred)       (&cred->_rw)
+#define to_cred(cred_rw)       (container_of(cred_rw, struct cred, _rw))
+#endif
+

This means the credential's security-sensitive fields (UIDs, GIDs, capabilities) live on sealed pages and cannot be tampered with, while the reference count and RCU linkage live on normal writable memory.

Debugging-wise, When a page fault occurs on a direct-mapped address, the fault handler checks whether the page is sealed or hidden and provides a clear diagnostic:

+   if (is_direct_mapped_addr((void *)address)) {
+       struct page *page = virt_to_page((void *)address);
+       if (PageSealed(page) || PageHidden(page)) {
+           pr_alert("BUG: unable to handle page fault for %s page at %pS\n",
+               PageSealed(page) ? "sealed" : "hidden", (void *)address);

Moreover, sealed and hidden page counts are exposed via /proc/meminfo and per-node meminfo, plus per-process stats in /proc/<pid>/smaps:

Sealed: — total sealed pages in kB
Hidden: — total hidden pages in kB

New kpageflags bits (KPF_SEALED = 62, KPF_HIDDEN = 63) are also exported.

As for PAX_PRIVATE_KSTACKS (in the context of PAX_KERNSEAL), it creates a per-CPU page table where each task gets a dedicated slot with guard pages. Only the current task's stack is accessible, via dynamic PTE-level (un)mapping magic. Underlying physical pages are allocated with __GFP_HIDDEN of course. For stack variables that require DMA/async access, an ad-hoc GCC plugin identifies them and stores them in a per-task dedicated page.

Even though PAX_PRIVATE_KSTACKS and PAX_KERNSEAL are conceptually simple mitigation, they are likely super-tedious to apply to the Linux kernel code behemoth. Tackling data-only attacks is hard, and the only other people seriously trying to address them is Apple, with their hardware-based KTRR/CTRR/GXF/APRR/PPL/SPTM/TXM mitigations. This makes KERNSEAL all the more remarkable, as like everything produced by the PaX Team, it doesn't require special hardware support.

Another interesting property of KERNSEAL is that it can serve as a basis for other interesting things, like ensuring that no guest pages are available at the hypervisor level in KVM for example. I can't wait to see what will be built on top next.

All in all, unsurprisingly, KERNSEAL is yet another all-around tour de force from the PaX Team, who keeps consistently producing stellar software-only mitigations before everyone else, since almost 25 years.