This post discusses a use-after-free vulnerability, CVE-2024-0582, in io_uring in the Linux kernel. Despite the vulnerability being patched in the stable kernel in December 2023, it wasn’t ported to Ubuntu kernels for over two months, making it an easy 0day vector in Ubuntu during that time.
In early January 2024, a Project Zero issue for a recently fixed io_uring
use-after-free (UAF) vulnerability (CVE-2024-0582) was made public. It was apparent that the vulnerability allowed an attacker to obtain read and write access to a number of previously freed pages. This seemed to be a very powerful primitive: usually a UAF gets you access to a freed kernel object, not a whole page – or even better, multiple pages. As the Project Zero issue also described, it was clear that this vulnerability should be easily exploitable: if an attacker has total access to free pages, once these pages are returned to a slab cache to be reused, they will be able to modify any contents of any object allocated within these pages. In the more common situation, the attacker can modify only a certain type of object, and possibly only at certain offsets or with certain values.
Moreover, this fact also suggests that a data-only exploit should be possible. In general terms, such an exploit does not rely on modifying the code execution flow, by building for instance a ROP chain or using similar techniques. Instead, it focuses on modifying certain data that ultimately grants the attacker root privileges, such as making read-only files writable by the attacker. This approach makes exploitation more reliable, stable, and allows bypassing some exploit mitigations such as Control-Flow Integrity (CFI), as the instructions executed by the kernel are not altered in any way.
Finally, according to the Project Zero issue, this vulnerability was present in the Linux kernel from versions starting at 6.4 and prior to 6.7. At that moment, Ubuntu 23.10 was running a vulnerable verison of 6.5 (and somewhat later so was Ubuntu 22.04 LTS), so it was a good opportunity to exploit the patch gap, understand how easy it would be for an attacker to do that, and how long they might possess an 0day exploit based on an Nday.
More precisely:
This post describes the data-only exploit strategy that we implemented, allowing a non-privileged user (and without the need of unprivileged user namespaces) to achieve root privileges on affected systems. First, a general overview of the io_uring
interface is given, as well as some more specific details of the interface relevant to this vulnerability. Next, an analysis of the vulnerability is provided. Finally, a strategy for a data-only exploit is presented.
The io_uring
interface is an asynchronous I/O API for Linux created by Jens Axboe and introduced in the Linux kernel version 5.1. Its goal is to improve performance of applications with a high number of I/O operations. It provides interfaces similar to functions like read()
and write()
, for example, but requests are satisfied in an asynchronous manner to avoid the context switching overhead caused by blocking system calls.
The io_uring
interface has been a bountiful target for a lot of vulnerability research; it was disabled in ChromeOS, production Google servers, and restricted in Android. As such, there are many blog posts that explain it with a lot of detail. Some relevant references are the following:
io_uring
operation that provides the same functionality (IORING_OP_PROVIDE_BUFFERS
) as the vulnerability discussed here (IORING_REGISTER_PBUF_RING
), and that has also a broad overview of this subsystem.io_uring
vulnerability is described.In the next subsections we give an overview of the io_uring
interface. We pay special attention to the Provided Buffer Ring functionality, which is relevant to the vulnerability discussed in this post. The reader can also check “What is io_uring?”, as well as the above references for alternative overviews of this subsystem.
The basis of io_uring
is a set of two ring buffers used for communication between user and kernel space. These are:
This model allows executing a number of I/O requests to be performed asynchronously using a single system call, while in a synchronous manner each request would have typically corresponded to a single system call. This reduces the overhead caused by blocking system calls, thus improving performance. Moreover, the use of shared buffers also reduces the overhead as no data between user and kernelspace has to be transferred.
The io_uring
API consists of three system calls:
io_uring_setup()
io_uring_register()
io_uring_enter()
The io_uring_setup()
system call sets up a context for an io_uring
instance, that is, a submission and a completion queue with the indicated number of entries each one. Its prototype is the following:
int io_uring_setup(u32 entries, struct io_uring_params *p);
Its arguments are:
entries
: It determines how many elements the SQ and CQ must have at the minimum.params
: It can be used by the application to pass options to the kernel, and by the kernel to pass information to the application about the ring buffers.On success, the return value of this system call is a file descriptor that can be later used to perform operation on the io_uring
instance.
The io_uring_register()
system call allows registering resources, such as user buffers, files, etc., for use in an io_uring
instance. Registering such resources makes the kernel map them, avoiding future copies to and from userspace, thus improving performance. Its prototype is the following:
int io_uring_register(unsigned int fd, unsigned int opcode, void *arg, unsigned int nr_args);
Its arguments are:
fd
: The file io_uring
file descriptor returned by the io_uring_setup()
system call.opcode
: The specific operation to be executed. It can have certain values such as IORING_REGISTER_BUFFERS
, to register user buffers, or IORING_UNREGISTER_BUFFERS
, to release the previously registered buffers.arg
: Arguments passed to the operation being executed. Their type depends on the specific opcode
being passed.nr_args
: Number of arguments in arg
being passed.On success, the return value of this system call is either zero or a positive value, depending on the opcode
used.
An application might need to have different types of registered buffers for different I/O requests. Since kernel version 5.7, to facilitate managing these different sets of buffers, io_uring
allows the application to register a pool of buffers that are identified by a group ID. This is done using the IORING_REGISTER_PBUF_RING
opcode in the io_uring_register()
system call.
More precisely, the application starts by allocating a set of buffers that it wants to register. Then, it makes the io_uring_register()
system call with opcode IORING_REGISTER_PBUF_RING
, specifying a group ID with which these buffers should be associated, a start address of the buffers, the length of each buffer, the number of buffers, and a starting buffer ID. This can be done for multiple sets of buffers, each one having a different group ID.
Finally, when submitting a request, the application can use the IOSQE_BUFFER_SELECT
flag and provide the desired group ID to indicate that a provided buffer ring from the corresponding set should be used. When the operation has been completed, the buffer ID of the buffer used for the operation is passed to the application via the corresponding CQE.
Provided buffer rings can be unregistered via the io_uring_register()
system call using the IORING_UNREGISTER_PBUF_RING
opcode.
In addition to the buffers allocated by the application, since kernel version 6.4, io_uring
allows a user to delegate the allocation of provided buffer rings to the kernel. This is done using the IOU_PBUF_RING_MMAP
flag passed as an argument to io_uring_register()
. In this case, the application does not need to previously allocate these buffers, and therefore the start address of the buffers does not have to be passed to the system call. Then, after io_uring_register()
returns, the application can mmap()
the buffers into userspace with the offset set as:
IORING_OFF_PBUF_RING | (bgid >> IORING_OFF_PBUF_SHIFT)
where bgid
is the corresponding group ID. These offsets, as well as others used to mmap()
the io_uring
data, are defined in include/uapi/linux/io_uring.h
:
/*
* Magic offsets for the application to mmap the data it needs
*/
#define IORING_OFF_SQ_RING 0ULL
#define IORING_OFF_CQ_RING 0x8000000ULL
#define IORING_OFF_SQES 0x10000000ULL
#define IORING_OFF_PBUF_RING 0x80000000ULL
#define IORING_OFF_PBUF_SHIFT 16
#define IORING_OFF_MMAP_MASK 0xf8000000ULL
The function that handles such an mmap()
call is io_uring_mmap()
:
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/io_uring.c#L3439
static __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
{
size_t sz = vma->vm_end - vma->vm_start;
unsigned long pfn;
void *ptr;
ptr = io_uring_validate_mmap_request(file, vma->vm_pgoff, sz);
if (IS_ERR(ptr))
return PTR_ERR(ptr);
pfn = virt_to_phys(ptr) >> PAGE_SHIFT;
return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);
}
Note that remap_pfn_range()
ultimately creates a mapping with the VM_PFNMAP
flag set, which means that the MM subsystem will treat the base pages as raw page frame number mappings wihout an associated page
structure. In particular, the core kernel will not keep reference counts of these pages, and keeping track of it is the responsability of the calling code (in this case, the io_uring
subsystem).
The io_uring_enter()
system call is used to initiate and complete I/O using the SQ and CQ that have been previously set up via the io_uring_setup()
system call. Its prototype is the following:
int io_uring_enter(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags, sigset_t *sig);
Its arguments are:
fd
: The io_uring
file descriptor returned by the io_uring_setup()
system call.to_submit
: Specifies the number of I/Os to submit from the SQ.flags
: A bitmask value that allows specifying certain options, such as IORING_ENTER_GETEVENTS
, IORING_ENTER_SQ_WAKEUP
, IORING_ENTER_SQ_WAIT
, etc.sig
: A pointer to a signal mask. If it is not NULL
, the system call replaces the current signal mask by the one pointed to by sig
, and when events become available in the CQ restores the original signal mask.The vulnerability can be triggered when an application registers a provided buffer ring with the IOU_PBUF_RING_MMAP
flag. In this case, the kernel allocates the memory for the provided buffer ring, instead of it being done by the application. To access the buffers, the application has to mmap()
them to get a virtual mapping. If the application later unregisters the provided buffer ring using the IORING_UNREGISTER_PBUF_RING
opcode, the kernel frees this memory and returns it to the page allocator. However, it does not have any mechanism to check whether the memory has been previously unmapped in userspace. If this has not been done, the application has a valid memory mapping to freed pages that can be reallocated by the kernel for other purposes. From this point, reading or writing to these pages will trigger a use-after-free.
The following code blocks show the affected parts of functions relevant to this vulnerability. Code snippets are demarcated by reference markers denoted by [N]
. Lines not relevant to this vulnerability are replaced by a [Truncated]
marker. The code corresponds to the Linux kernel version 6.5.3, which corresponds to the version used in the Ubuntu kernel 6.5.0-15-generic
.
The handler of the IORING_REGISTER_PBUF_RING
opcode for the io_uring_register()
system call is the io_register_pbuf_ring()
function, shown in the next listing.
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L537
int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
{
struct io_uring_buf_reg reg;
struct io_buffer_list *bl, *free_bl = NULL;
int ret;
[1]
if (copy_from_user(&reg, arg, sizeof(reg)))
return -EFAULT;
[Truncated]
if (!is_power_of_2(reg.ring_entries))
return -EINVAL;
[2]
/* cannot disambiguate full vs empty due to head/tail size */
if (reg.ring_entries >= 65536)
return -EINVAL;
if (unlikely(reg.bgid io_bl)) {
int ret = io_init_bl_list(ctx);
if (ret)
return ret;
}
bl = io_buffer_get_list(ctx, reg.bgid);
if (bl) {
/* if mapped buffer ring OR classic exists, don't allow */
if (bl->is_mapped || !list_empty(&bl->buf_list))
return -EEXIST;
} else {
[3]
free_bl = bl = kzalloc(sizeof(*bl), GFP_KERNEL);
if (!bl)
return -ENOMEM;
}
[4]
if (!(reg.flags & IOU_PBUF_RING_MMAP))
ret = io_pin_pbuf_ring(&reg, bl);
else
ret = io_alloc_pbuf_ring(&reg, bl);
[Truncated]
return ret;
}
The function starts by copying the provided arguments into an io_uring_buf_reg
structure reg
[1]. Then, it checks that the desired number of entries is a power of two and is strictly less than 65536 [2]. Note that this implies that the maximum number of allowed entries is 32768.
Next, it checks whether a provided buffer list with the specified group ID reg.bgid
exists and, in case it does not, an io_buffer_list
structure is allocated and its address is stored in the variable bl
[3]. Finally, if the provided arguments have the flag IOU_PBUF_RING_MMAP
set, the io_alloc_pbuf_ring()
function is called [4], passing in the address of the structure reg
, which contains the arguments passed to the system call, and the pointer to the allocated buffer list structure bl
.
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L519
static int io_alloc_pbuf_ring(struct io_uring_buf_reg *reg,
struct io_buffer_list *bl)
{
gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP;
size_t ring_size;
void *ptr;
[5]
ring_size = reg->ring_entries * sizeof(struct io_uring_buf_ring);
[6]
ptr = (void *) __get_free_pages(gfp, get_order(ring_size));
if (!ptr)
return -ENOMEM;
[7]
bl->buf_ring = ptr;
bl->is_mapped = 1;
bl->is_mmap = 1;
return 0;
}
The io_alloc_pbuf_ring()
function takes the number of ring entries specified in reg->ring_entries
and computes the resulting size ring_size
by multiplying it by the size of the io_uring_buf_ring
structure [5], which is 16 bytes. Then, it requests a number of pages from the page allocator that can fit this size via a call to __get_free_pages()
[6]. Note that for the maximum number of allowed ring entries, 32768, ring_size
is 524288 and thus the maximum number of 4096-byte pages that can be retrieved is 128. The address of the first page is then stored in the io_buffer_list
structure, more precisely in bl->buf_ring
[7]. Also, bl->is_mapped
and bl->is_mmap
are set to 1.
The handler of the IORING_UNREGISTER_PBUF_RING
opcode for the io_uring_register()
system call is the io_unregister_pbuf_ring()
function, shown in the next listing.
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L601
int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
{
struct io_uring_buf_reg reg;
struct io_buffer_list *bl;
[8]
if (copy_from_user(&reg, arg, sizeof(reg)))
return -EFAULT;
if (reg.resv[0] || reg.resv[1] || reg.resv[2])
return -EINVAL;
if (reg.flags)
return -EINVAL;
[9]
bl = io_buffer_get_list(ctx, reg.bgid);
if (!bl)
return -ENOENT;
if (!bl->is_mapped)
return -EINVAL;
[10]
__io_remove_buffers(ctx, bl, -1U);
if (bl->bgid >= BGID_ARRAY) {
xa_erase(&ctx->io_bl_xa, bl->bgid);
kfree(bl);
}
return 0;
}
Again, the function starts by copying the provided arguments into a io_uring_buf_reg
structure reg
[8]. Then, it retrieves the provided buffer list corresponding to the group ID specified in reg.bgid
and stores its address in the variable bl
[9]. Finally, it passes bl
to the function __io_remove_buffers()
[10].
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/io_uring/kbuf.c#L209
static int __io_remove_buffers(struct io_ring_ctx *ctx,
struct io_buffer_list *bl, unsigned nbufs)
{
unsigned i = 0;
/* shouldn't happen */
if (!nbufs)
return 0;
if (bl->is_mapped) {
i = bl->buf_ring->tail - bl->head;
if (bl->is_mmap) {
struct page *page;
[11]
page = virt_to_head_page(bl->buf_ring);
[12]
if (put_page_testzero(page))
free_compound_page(page);
bl->buf_ring = NULL;
bl->is_mmap = 0;
} else if (bl->buf_nr_pages) {
[Truncated]
In case the buffer list structure has the is_mapped
and is_mmap
flags set, which is the case when the buffer ring was registered with the IOU_PBUF_RING_MMAP
flag [7], the function reaches [11]. Then, the page
structure of the head page corresponding to the virtual address of the buffer ring bl->buf_ring
is obtained. Finally, all the pages forming the compound page with head page
are freed at [12], thus returning them to the page allocator.
Note that if the provided buffer ring is set up with IOU_PBUF_RING_MMAP
, that is, it has been allocated by the kernel and not the application, the userspace application is expected to have previously mmap()
ed this memory. Moreover, recall that since the memory mapping was created with the VM_PFNMAP
flag, the reference count of the page
structure was not modified during this operation. In other words, in the code above there is no way for the kernel to know whether the application has unmapped the memory before freeing it via the call to free_compound_page()
. If this has not happened, a use-after-free can be triggered by the application by just reading or writing to this memory.
The exploitation mechanism presented in this post relies on how memory allocation works on Linux, so the reader is expected to have some familiarity with it. As a refresher, we highlight the following facts:
One of such cache slabs is the filp
, which contains file
structures. A file
structure, shown in the next listing, represents an open file.
// Source: https://elixir.bootlin.com/linux/v6.5.3/source/include/linux/fs.h#L961
struct file {
union {
struct llist_node f_llist;
struct rcu_head f_rcuhead;
unsigned int f_iocb_flags;
};
/*
* Protects f_ep, f_flags.
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
fmode_t f_mode;
atomic_long_t f_count;
struct mutex f_pos_lock;
loff_t f_pos;
unsigned int f_flags;
struct fown_struct f_owner;
const struct cred *f_cred;
struct file_ra_state f_ra;
struct path f_path;
struct inode *f_inode; /* cached value */
const struct file_operations *f_op;
u64 f_version;
#ifdef CONFIG_SECURITY
void *f_security;
#endif
/* needed for tty driver, and maybe others */
void *private_data;
#ifdef CONFIG_EPOLL
/* Used by fs/eventpoll.c to link all the hooks to this file */
struct hlist_head *f_ep;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
errseq_t f_wb_err;
errseq_t f_sb_err; /* for syncfs */
} __randomize_layout
__attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
The most relevant fields for this exploit are the following:
f_mode
: Determines whether the file is readable or writable.f_pos
: Determines the current reading or writing position.f_op
: The operations associated with the file. It determines the functions to be executed when certain system calls such as read()
, write()
, etc., are issued on the file. For files in ext4
filesystems, this is equal to the ext4_file_operations
variable.The exploit primitive provides an attacker with read and write access to a certain number of free pages that have been returned to the page allocator. By opening a file a large number of times, the attacker can force the exhaustion of all the slabs in the filp
cache, so that free pages are requested to the page allocator to create a new slab in this cache. In this case, further allocations of file
structures will happen in the pages on which the attacker has read and write access, thus being able to modify them. In particular, for example, by modifying the f_mode
field, the attacker can make a file that has been opened with read-only permissions to be writable.
This strategy was implemented to successfully exploit the following versions of Ubuntu:
6.5.0-15-generic
.6.5.0-17-generic
.6.5.0-15-generic
.6.5.0-17-generic
.The next subsections give more details on how this strategy can be carried out.
The strategy begins by triggering the vulnerability to obtain read and write access to freed pages. This can be done by executing the following steps:
io_uring_setup()
system call to set up the io_uring
instance.io_uring_register()
system call with opcode IORING_REGISTER_PBUF_RING
and the IOU_PBUF_RING_MMAP
flag, so that the kernel itself allocates the memory for the provided buffer ring.
mmap()
ing the memory of the provided buffer ring with read and write permissions, using the io_uring
file descriptor and the offset IORING_OFF_PBUF_RING
.io_uring_register()
system call with opcode IORING_UNREGISTER_PBUF_RING
. At this point, the pages corresponding to the provided buffer ring have been returned to the page allocator, while the attacker still has a valid reference to them.
The next step is spawning a large number of child processes, each one opening the file /etc/passwd
many times with read-only permissions. This forces the allocation of corresponding file
structures in the kernel.
By opening a large number of files, the attacker can force the exhaustion of the slabs in the filp
cache. After that, new slabs will be allocated by requesting free pages from the page allocator. At some point, the pages that previously corresponded to the provided buffer ring, and to which the attacker still has read and write access, will be returned by the page allocator.
Hence, all of the file
structures created after this point will be allocated in the attacker-controlled memory region, giving them the possibility to modify the structures.
Note that these child processes have to wait until indicated to proceed in the last stage of the exploit, so that the files are kept open and their corresponding structures are not freed.
Although the attacker may have access to some slabs belonging to the filp
cache, they don’t know where they are within the memory region. To identify these slabs, however, the attacker can search for the ext4_file_operations
address at the offset of the file.f_op
field within the file
structure. When one is found, it can be safely assumed that it corresponds to the file
structure of one instance of the previously opened /etc/passwd
file.
Note that even when Kernel Address Space Layout Randomization (KASLR) is enabled, to identify the ext4_file_operations
address in memory it is only necessary to know the offset of this symbol with respect to the _text
symbol, so there is no need for a KASLR bypass. Indeed, given a value val
of an unsigned integer found in memory at the corresponding offset, one can safely assume that it is the address of ext4_file_operations
if:
(val >> 32 & 0xffffffff) == 0xffffffff
, i.e. the 32 most significant bits are all 1.(val & 0xfffff) == (ext4_fops_offset & 0xfffff)
, i.e. the 20 least significant bits of val
and ext4_fops_offset
, the offset of ext4_file_operations
with respect to _text
, are the same.Once a file
structure corresponding to the /etc/passwd
file is located in the memory region accessible by the attacker, it can be modified at will. In particular, setting the FMODE_WRITE
and FMODE_CAN_WRITE
flags in the file.f_mode
field of the found structure will make the /etc/passwd
file writable when using the corresponding file descriptor.
Moreover, setting the file.f_pos
field of the found file
structure to the current size of the /etc/passwd/
file, the attacker can ensure that any data written to it is appended at the end of the file.
To finish, the attacker can signal all the child processes spawned in the second stage to try to write to the opened /etc/passwd
file. While most of all of such attempts will fail, as the file was opened with read-only permissions, the one corresponding to the modified file
structure, which has write permissions enabled due to the modification of the file->f_mode
field, will succeed.
To sum up, in this post we described a use-after-free vulnerability that was recently disclosed in the io_uring
subsystem of the Linux kernel, and a data-only exploit strategy was presented. This strategy proved to be realitvely simple to implement. During our tests it proved to be very reliable and, when it failed, it did not affect the stability of the system. This strategy allowed us to exploit up-to-date versions of Ubuntu during the patch gap window of about two months.
Our world class team of vulnerability researchers discover hundreds of exclusive Zero-Day vulnerabilities, providing our clients with proprietary knowledge before the adversaries find them. We also conduct N-Day research, where we select critical N-Day vulnerabilities and complete research to prove whether these vulnerabilities are truly exploitable in the wild.
For more information on our products and how we can help your vulnerability efforts, visit www.exodusintel.com or contact [email protected] for further discussion.