11/17/2023
19 min read
Translating source code written in a high-level programming language into an executable binary typically involves a series of steps, namely compiling and assembling the code into object files, and then linking those object files into the final executable. However, there are certain scenarios where it can be useful to apply an alternate approach that involves executing object files directly, bypassing the linker. For example, we might use it for malware analysis or when part of the code requires an incompatible compiler. We’ll be focusing on the latter scenario: when one of our libraries needed to be compiled differently from the rest of the code. Learning how to execute an object file directly will give you a much better sense of how code is compiled and linked together.
To demonstrate how this was done, we have previously published a series of posts on executing an object file:
- How to execute an object file: Part 1
- How to execute an object file: Part 2
- How to execute an object file: Part 3
The initial posts are dedicated to the x86 architecture. Since then the fleet of our working machines has expanded to include a large and growing number of ARM CPUs. This time we’ll repeat this exercise for the aarch64 architecture. You can pause here to read the previous blog posts before proceeding with this one, or read through the brief summary below and reference the earlier posts for more detail. We might reiterate some theory as working with ELF files can be daunting, if it’s not your day-to-day routine. Also, please be mindful that for simplicity, these examples omit bounds and integrity checks. Let the journey begin!
Introduction
In order to obtain an object file or an executable binary from a high-level compiled programming language the code needs to be processed by three components: compiler, assembler and linker. The compiler generates an assembly listing. This assembly listing is picked up by the assembler and translated into an object file. All source files, if a program contains multiple, go through these two steps generating an object file for each source file. At the final step the linker unites all object files into one binary, additionally resolving references to the shared libraries (i.e. we don’t implement the printf
function each time, rather we take it from a system library). Even though the approach is platform independent, the compiler output varies by platform as the assembly listing is closely tied to the CPU architecture.
GCC (GNU Compiler Collection) can run each step: compiler, assembler and linker separately for us:
main.c:
#include <stdio.h>
int main(void)
{
puts("Hello, world!");
return 0;
}
Compiler (output main.s
- assembly listing):
$ gcc -S main.c
$ ls
main.c main.s
Assembler (output main.o
- an object file):
$ gcc -c main.s -o main.o
$ ls
main.c main.o main.s
Linker (main
- an object file):
$ gcc main.o -o main
$ ls
main main.c main.o main.s
$ ./main
Hello, world!
All the examples assume gcc is running on a native aarch64 architecture or include a cross compilation flag for those who want to reproduce and have no aarch64.
We have two object files in the output above: main.o
and main
. Object files are files encoded with the ELF (Executable and Linkable Format) standard. Although, main.o
is an ELF file, it doesn’t contain all the information to be fully executable.
$ file main.o
main.o: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), not stripped
$ file main
main: ELF 64-bit LSB pie executable, ARM aarch64, version 1 (SYSV), dynamically
linked, interpreter /lib/ld-linux-aarch64.so.1,
BuildID[sha1]=d3ecd2f8ac3b2dec11ed4cc424f15b3e1f130dd4, for GNU/Linux 3.7.0, not stripped
The ELF File
The central idea of this series of blog posts is to understand how to resolve dependencies from object files without directly involving the linker. For illustrative purposes we generated an object file based on some C-code and used it as a library for our main program. Before switching to the code, we need to understand the basics of the ELF structure.
Each ELF file is made up of one ELF header, followed by file data. The data can include: a program header table, a section header table, and the data which is referred to by the program or section header tables.
The ELF header provides some basic information about the file: what architecture the file is compiled for, the program entry point and the references to other tables.
The ELF Header:
$ readelf -h main
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Position-Independent Executable file)
Machine: AArch64
Version: 0x1
Entry point address: 0x640
Start of program headers: 64 (bytes into file)
Start of section headers: 68576 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 9
Size of section headers: 64 (bytes)
Number of section headers: 29
Section header string table index: 28
The execution process of almost every program starts from an auxiliary program, called loader, which arranges the memory and calls the program’s entry point. In the following output the loader is marked with a line “Requesting program interpreter: /lib/ld-linux-aarch64.so.1”
. The whole program memory is split into different segments with associated size, permissions and type (which instructs the loader on how to interpret this block of memory). Because the execution process should be performed in the shortest possible time, the sections with the same characteristics and located nearby are grouped into bigger blocks — segments — and placed in the program header. We can say that the program header summarizes the types of data that appear in the section header.
The ELF Program Header:
$ readelf -Wl main
Elf file type is DYN (Position-Independent Executable file)
Entry point 0x640
There are 9 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R 0x8
INTERP 0x000238 0x0000000000000238 0x0000000000000238 0x00001b 0x00001b R 0x1
[Requesting program interpreter: /lib/ld-linux-aarch64.so.1]
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x00088c 0x00088c R E 0x10000
LOAD 0x00fdc8 0x000000000001fdc8 0x000000000001fdc8 0x000270 0x000278 RW 0x10000
DYNAMIC 0x00fdd8 0x000000000001fdd8 0x000000000001fdd8 0x0001e0 0x0001e0 RW 0x8
NOTE 0x000254 0x0000000000000254 0x0000000000000254 0x000044 0x000044 R 0x4
GNU_EH_FRAME 0x0007a0 0x00000000000007a0 0x00000000000007a0 0x00003c 0x00003c R 0x4
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW 0x10
GNU_RELRO 0x00fdc8 0x000000000001fdc8 0x000000000001fdc8 0x000238 0x000238 R 0x1
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame
03 .init_array .fini_array .dynamic .got .got.plt .data .bss
04 .dynamic
05 .note.gnu.build-id .note.ABI-tag
06 .eh_frame_hdr
07
08 .init_array .fini_array .dynamic .got
In the source code of high-level languages, variables, functions, and constants are mixed together. However, in assembly you might see that the data and instructions are separated into different blocks. The ELF file content is divided in an even more granular way. For example, variables with initial values are placed into different sections than the uninitialized ones. This approach optimizes for space, otherwise the values for uninitialized variables would be filled with zeros. Along with the space efficiency, there are security reasons for stratification — executable instructions can’t have writable permissions, while memory containing variables can't be executable. The section header describes each of these sections.
The ELF Section Header:
$ readelf -SW main
There are 29 section headers, starting at offset 0x10be0:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .interp PROGBITS 0000000000000238 000238 00001b 00 A 0 0 1
[ 2] .note.gnu.build-id NOTE 0000000000000254 000254 000024 00 A 0 0 4
[ 3] .note.ABI-tag NOTE 0000000000000278 000278 000020 00 A 0 0 4
[ 4] .gnu.hash GNU_HASH 0000000000000298 000298 00001c 00 A 5 0 8
[ 5] .dynsym DYNSYM 00000000000002b8 0002b8 0000f0 18 A 6 3 8
[ 6] .dynstr STRTAB 00000000000003a8 0003a8 000092 00 A 0 0 1
[ 7] .gnu.version VERSYM 000000000000043a 00043a 000014 02 A 5 0 2
[ 8] .gnu.version_r VERNEED 0000000000000450 000450 000030 00 A 6 1 8
[ 9] .rela.dyn RELA 0000000000000480 000480 0000c0 18 A 5 0 8
[10] .rela.plt RELA 0000000000000540 000540 000078 18 AI 5 22 8
[11] .init PROGBITS 00000000000005b8 0005b8 000018 00 AX 0 0 4
[12] .plt PROGBITS 00000000000005d0 0005d0 000070 00 AX 0 0 16
[13] .text PROGBITS 0000000000000640 000640 000134 00 AX 0 0 64
[14] .fini PROGBITS 0000000000000774 000774 000014 00 AX 0 0 4
[15] .rodata PROGBITS 0000000000000788 000788 000016 00 A 0 0 8
[16] .eh_frame_hdr PROGBITS 00000000000007a0 0007a0 00003c 00 A 0 0 4
[17] .eh_frame PROGBITS 00000000000007e0 0007e0 0000ac 00 A 0 0 8
[18] .init_array INIT_ARRAY 000000000001fdc8 00fdc8 000008 08 WA 0 0 8
[19] .fini_array FINI_ARRAY 000000000001fdd0 00fdd0 000008 08 WA 0 0 8
[20] .dynamic DYNAMIC 000000000001fdd8 00fdd8 0001e0 10 WA 6 0 8
[21] .got PROGBITS 000000000001ffb8 00ffb8 000030 08 WA 0 0 8
[22] .got.plt PROGBITS 000000000001ffe8 00ffe8 000040 08 WA 0 0 8
[23] .data PROGBITS 0000000000020028 010028 000010 00 WA 0 0 8
[24] .bss NOBITS 0000000000020038 010038 000008 00 WA 0 0 1
[25] .comment PROGBITS 0000000000000000 010038 00001f 01 MS 0 0 1
[26] .symtab SYMTAB 0000000000000000 010058 000858 18 27 66 8
[27] .strtab STRTAB 0000000000000000 0108b0 00022c 00 0 0 1
[28] .shstrtab STRTAB 0000000000000000 010adc 000103 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
D (mbind), p (processor specific)
Executing example from Part 1 on aarch64
Actually, our initial code from Part 1 works on aarch64 as is!
Let’s have a quick summary about what was done in the code:
- We need to find the code of two functions (
add5
andadd10
) in the.text
section of our object file (obj.o
) - Load the functions in the executable memory
- Return the memory locations of the functions to the main program
There is one nuance: even though all the sections are in the section header, neither of them have a string name. Without the names we can’t identify them. However, having an additional character field for each section in the ELF structure would be inefficient for the space — it must be limited by some maximum length and those names which are shorter would leave the space unfilled. Instead, ELF provides an additional section, .shstrtab
. This string table concatenates all the names where each name ends with a null terminated byte. We can iterate over the names and match with an offset held by other sections to reference their name. But how do we find .shstrtab
itself if we don’t have a name? To solve this chicken and egg problem, the ELF program header provides a direct pointer to .shstrtab
. The similar approach is applied to two other sections: .symtab
and .strtab
. Where .symtab
contains all information about the symbols and .strtab
holds the list of symbol names. In the code we work with these tables to resolve all their dependencies and find our functions.
Executing example from Part 2 on aarch64
At the beginning of the second blog post we made the function add10
depend on add5
instead of being self-contained. This is the first time when we faced relocations. Relocations is the process of loading symbols defined outside the current scope. The relocated symbols can present global or thread-local variables, constant, functions, etc. We’ll start from checking assembly instructions which trigger relocations and uncovering how the ELF format handles them in a more general way.
After making add10
depend on add5
our aarch64 version stopped working as well, similarly to the x86. Let’s take a look at assembly listing:
$ objdump --disassemble --section=.text obj.o
obj.o: file format elf64-littleaarch64
Disassembly of section .text:
0000000000000000 <add5>:
0: d10043ff sub sp, sp, #0x10
4: b9000fe0 str w0, [sp, #12]
8: b9400fe0 ldr w0, [sp, #12]
c: 11001400 add w0, w0, #0x5
10: 910043ff add sp, sp, #0x10
14: d65f03c0 ret
0000000000000018 <add10>:
18: a9be7bfd stp x29, x30, [sp, #-32]!
1c: 910003fd mov x29, sp
20: b9001fe0 str w0, [sp, #28]
24: b9401fe0 ldr w0, [sp, #28]
28: 94000000 bl 0 <add5>
2c: b9001fe0 str w0, [sp, #28]
30: b9401fe0 ldr w0, [sp, #28]
34: 94000000 bl 0 <add5>
38: a8c27bfd ldp x29, x30, [sp], #32
3c: d65f03c0 ret
Have you noticed that all the hex values in the second column are exactly the same length, in contrast with the instructions lengths seen for x86 in Part 2 of our series? This is because all Armv8-A instructions are presented in 32 bits. Since it is impossible to encode every immediate value into less than 32 bits, some operations require more than one instruction, as we’ll see later. For now, we’re interested in one instruction - bl
(branch with link) on rows 28
and 34
. The bl
is a “jump” instruction, but before the jump it preserves the next instruction after the current one in the link register (lr
). When the callee finishes execution the caller address is recovered from lr
. Usually, the aarch64 instructions reserve the last 6 bits [31:26] for opcode and some auxiliary fields such as running architecture (32 or 64 bits), condition flag and others. Remaining bits are shared between arguments like source register, destination register and immediate value. Since the bl
instruction does not require a source or destination register, the full 26 bits can be used to encode the immediate offset instead. However, 26 bits can only encode a small range (+/-32 MB), but because the jump can only target a beginning of an instruction, it must always be aligned to 4 bytes, which increases the effective range of the encoded immediate fourfold, to +/-128 MB.
Similarly to what we did in Part 2 we’re going to resolve our relocations - first by manually calculating the correct addresses and then by using an approach similar to what the linker does. The current value of our bl
instruction is 94000000
or in binary representation 10010100000000000000000000000000
. All 26 bits are zeros, so we don't jump anywhere. The address is calculated by an offset from the current program counter (pc
), which can be positive or negative. In our case we expect it to be -0x28
and -0x34
. As described above, it should be divided by 4 and taken as two's complements: -0x28 / 4 = -0xA == 0xFFFFFFF6
and -0x34 / 4 = -0xD == 0xFFFFFFF3
. From these values we need to take the lower 26 bits and concatenate them with the initial 6 bits to get the final instruction. So, the final ones will be: 10010111111111111111111111110110 == 0x97FFFFF6
and 10010111111111111111111111110011 == 0x97FFFFF3
. Have you noticed that all the distance calculations are done relative to the bl
(or current pc
), not the next instruction as in x86?
Let’s add to the code and execute:
...
static void parse_obj(void)
{
...
/* copy the contents of `.text` section from the ELF file */
memcpy(text_runtime_base, obj.base + text_hdr->sh_offset, text_hdr->sh_size);
*((uint32_t *)(text_runtime_base + 0x28)) = 0x97FFFFF6;
*((uint32_t *)(text_runtime_base + 0x34)) = 0x97FFFFF3;
/* make the `.text` copy readonly and executable */
if (mprotect(text_runtime_base, page_align(text_hdr->sh_size), PROT_READ | PROT_EXEC)) {
...
Compile and run:
$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
It works! But this is not how the linker handles the relocations. The linker resolves relocation based on the type and formula assigned to this type. In our Part 2 we investigated it quite well. Here again we need to find the type and check the formula for this type:
$ readelf --relocs obj.o
Relocation section '.rela.text' at offset 0x228 contains 2 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000028 000a0000011b R_AARCH64_CALL26 0000000000000000 add5 + 0
000000000034 000a0000011b R_AARCH64_CALL26 0000000000000000 add5 + 0
Relocation section '.rela.eh_frame' at offset 0x258 contains 2 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000001c 000200000105 R_AARCH64_PREL32 0000000000000000 .text + 0
000000000034 000200000105 R_AARCH64_PREL32 0000000000000000 .text + 18
Our Type is R_AARCH64_CALL26 and the formula for it is:
ELF64 Code |
Name |
Operation |
283 |
R_<CLS>_CALL26 |
S + A - P |
where:
S
(when used on its own) is the address of the symbolA
is the addend for the relocationP
is the address of the place being relocated (derived fromr_offset
)
Here are the relevant changes to loader.c:
/* Replace `#define R_X86_64_PLT32 4` with our Type */
#define R_AARCH64_CALL26 283
...
static void do_text_relocations(void)
{
...
uint32_t val;
switch (type)
{
case R_AARCH64_CALL26:
/* The mask separates opcode (6 bits) and the immediate value */
uint32_t mask_bl = (0xffffffff << 26);
/* S+A-P, divided by 4 */
val = (symbol_address + relocations[i].r_addend - patch_offset) >> 2;
/* Concatenate opcode and value to get final instruction */
*((uint32_t *)patch_offset) &= mask_bl;
val &= ~mask_bl;
*((uint32_t *)patch_offset) |= val;
break;
}
...
}
Compile and run:
$ gcc -o loader loader.c
$ ./loader
Calculated relocation: 0x97fffff6
Calculated relocation: 0x97fffff3
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
So far so good. The next challenge is to add constant data and global variables to our object file and check relocations again:
$ readelf --relocs --wide obj.o
Relocation section '.rela.text' at offset 0x388 contains 8 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000000 0000000500000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .rodata + 0
0000000000000004 0000000500000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .rodata + 0
000000000000000c 0000000300000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .data + 0
0000000000000010 0000000300000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .data + 0
0000000000000024 0000000300000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .data + 0
0000000000000028 0000000300000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .data + 0
0000000000000068 000000110000011b R_AARCH64_CALL26 0000000000000040 add5 + 0
0000000000000074 000000110000011b R_AARCH64_CALL26 0000000000000040 add5 + 0
...
We have even two new relocations: R_AARCH64_ADD_ABS_LO12_NC
and R_AARCH64_ADR_PREL_PG_HI21
. Their formulas are:
ELF64 Code |
Name |
Operation |
275 |
R_<CLS>_ ADR_PREL_PG_HI21 |
Page(S+A) - Page(P) |
277 |
R_<CLS>_ ADD_ABS_LO12_NC |
S + A |
where:
Page(expr)
is the page address of the expression expr, defined as (expr & ~0xFFF)
. (This applies even if the machine page size supported by the platform has a different value.)
It’s a bit unclear why we have two new types, while in x86 we had only one. Let’s investigate the assembly code:
$ objdump --disassemble --section=.text obj.o
obj.o: file format elf64-littleaarch64
Disassembly of section .text:
0000000000000000 <get_hello>:
0: 90000000 adrp x0, 0 <get_hello>
4: 91000000 add x0, x0, #0x0
8: d65f03c0 ret
000000000000000c <get_var>:
c: 90000000 adrp x0, 0 <get_hello>
10: 91000000 add x0, x0, #0x0
14: b9400000 ldr w0, [x0]
18: d65f03c0 ret
000000000000001c <set_var>:
1c: d10043ff sub sp, sp, #0x10
20: b9000fe0 str w0, [sp, #12]
24: 90000000 adrp x0, 0 <get_hello>
28: 91000000 add x0, x0, #0x0
2c: b9400fe1 ldr w1, [sp, #12]
30: b9000001 str w1, [x0]
34: d503201f nop
38: 910043ff add sp, sp, #0x10
3c: d65f03c0 ret
We see that all adrp
instructions are followed by add
instructions. The add
instruction adds an immediate value to the source register and writes the result to the destination register. The source and destination registers can be the same, the immediate value is 12 bits. The adrp
instruction generates a pc
-relative (program counter) address and writes the result to the destination register. It takes pc
of the instruction itself and adds a 21-bit immediate value shifted left by 12 bits. If the immediate value weren’t shifted it would lie in a range of +/-1 MB memory, which isn’t enough. The left shift increases the range up to +/-1 GB. However, because the 12 bits are masked out with the shift, we need to store them somewhere and restore later. That’s why we see add instruction following adrp
and two types instead of one. Also, it’s a bit tricky to encode adrp
: 2 low bits of immediate value are placed in the position 30:29 and the rest in the position 23:5. Due to size limitations, the aarch64 instructions try to make the most out of 32 bits.
In the code we are going to use the formulas to calculate the values and description of adrp
and add
instructions to obtain the final opcode:
#define R_AARCH64_CALL26 283
#define R_AARCH64_ADD_ABS_LO12_NC 277
#define R_AARCH64_ADR_PREL_PG_HI21 275
...
{
case R_AARCH64_CALL26:
/* The mask separates opcode (6 bits) and the immediate value */
uint32_t mask_bl = (0xffffffff << 26);
/* S+A-P, divided by 4 */
val = (symbol_address + relocations[i].r_addend - patch_offset) >> 2;
/* Concatenate opcode and value to get final instruction */
*((uint32_t *)patch_offset) &= mask_bl;
val &= ~mask_bl;
*((uint32_t *)patch_offset) |= val;
break;
case R_AARCH64_ADD_ABS_LO12_NC:
/* The mask of `add` instruction to separate
* opcode, registers and calculated value
*/
uint32_t mask_add = 0b11111111110000000000001111111111;
/* S + A */
uint32_t val = *(symbol_address + relocations[i].r_addend);
val &= ~mask_add;
*((uint32_t *)patch_offset) &= mask_add;
/* Final instruction */
*((uint32_t *)patch_offset) |= val;
case R_AARCH64_ADR_PREL_PG_HI21:
/* Page(S+A)-Page(P), Page(expr) is defined as (expr & ~0xFFF) */
val = (((uint64_t)(symbol_address + relocations[i].r_addend)) & ~0xFFF) - (((uint64_t)patch_offset) & ~0xFFF);
/* Shift right the calculated value by 12 bits.
* During decoding it will be shifted left as described above,
* so we do the opposite.
*/
val >>= 12;
/* Separate the lower and upper bits to place them in different positions */
uint32_t immlo = (val & (0xf >> 2)) << 29 ;
uint32_t immhi = (val & ((0xffffff >> 13) << 2)) << 22;
*((uint32_t *)patch_offset) |= immlo;
*((uint32_t *)patch_offset) |= immhi;
break;
}
Compile and run:
$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42
It works! The final code is here.
Executing example from Part 3 on aarch64
Our Part 3 is about resolving external dependencies. When we write code we don’t think much about how to allocate memory or print debug information to the console. Instead, we involve functions from the system libraries. But the code of system libraries needs to be passed through to our programs somehow. Additionally, for optimization purposes, it would be nice if this code would be stored in one place and shared between all programs. And another wish — we don’t want to resolve all the functions and global variables from the libraries, only those which we need and at those times when we need them. To solve these problems, ELF introduced two sections: PLT (the procedure linkage table) and GOT (the global offset table). The dynamic loader creates a list which contains all external functions and variables from the shared library, but doesn’t resolve them immediately; instead they are placed in the PLT section. Each external symbol is presented by a small function, a stub, e.g. puts@plt
. When an external symbol is requested, the stub checks if it was resolved previously. If not, the stub searches for an absolute address of the symbol, returns to the requester and writes it in the GOT table. The next time, the address returns directly from the GOT table.
In Part 3 we implemented a simplified PLT/GOT resolution. Firstly, we added a new function say_hello
in the obj.c
, which calls unresolved system library function puts
. Further we added an optional wrapper my_puts
in the loader.c
. The wrapper isn’t required, we could’ve resolved directly to a standard function, but it's a good example of how the implementation of some functions can be overwritten with custom code. In the next steps we added our PLT/GOT resolution:
- PLT section we replaced with a jumptable
- GOT we replaced with assembly instructions
Basically, we created a small stub with assembly code (our jumptable
) to resolve the global address of our my_puts
wrapper and jump to it.
The approach for aarch64 is the same. But the jumptable
is very different as it consists of different assembly instructions.
The big difference here compared to the other parts is that we need to work with a 64-bit address for the GOT resolution. Our custom PLT or jumptable
is placed close to the main code of obj.c
and can operate with the relative addresses as before. For the GOT or referencing my_puts
wrapper we’ll use different branch instructions — br
or blr
. These instructions branch to the register, where the aarch64 registers can hold 64-bit values.
We can check how it resolves with the native PLT/GOT in our loader assembly code:
$ objdump --disassemble --section=.text loader
...
1d2c: 97fffb45 bl a40 <puts@plt>
1d30: f94017e0 ldr x0, [sp, #40]
1d34: d63f0000 blr x0
...
The first instruction is bl
jump to puts@plt
stub. The next ldr
instruction tells us that some value was loaded into the register x0
from the stack. Each function has its own stack frame to hold the local variables. The last blr
instruction makes a jump to the address stored in x0
register. There is an agreement in the register naming: if the stored value is 64-bits then the register is called x0-x30
; if only 32-bits are used then it’s called w0-w30
(the value will be stored in the lower 32-bits and upper 32-bits will be zeroed).
We need to do something similar — place the absolute address of our my_puts
wrapper in some register and call br
on this register. We don’t need to store the link before branching, the call will be returned to say_hello
from obj.c
, which is why a plain br
will be enough. Let’s check an assembly of simple C-function:
hello.c:
#include <stdint.h>
void say_hello(void)
{
uint64_t reg = 0x555555550c14;
}
$ gcc -c hello.c
$ objdump --disassemble --section=.text hello.o
hello.o: file format elf64-littleaarch64
Disassembly of section .text:
0000000000000000 <say_hello>:
0: d10043ff sub sp, sp, #0x10
4: d2818280 mov x0, #0xc14 // #3092
8: f2aaaaa0 movk x0, #0x5555, lsl #16
c: f2caaaa0 movk x0, #0x5555, lsl #32
10: f90007e0 str x0, [sp, #8]
14: d503201f nop
18: 910043ff add sp, sp, #0x10
1c: d65f03c0 ret
The number 0x555555550c14
is the address returned by lookup_ext_function
. We’ve printed it out to use as an example, but any 48-bits hex value can be used.
In our output we see that the value was split in three sections and written in x0
register with three instructions: one mov
and two movk
. The documentation says that there are only 16 bits for the immediate value, but a shift can be applied (in our case left shift lsl
).
However, we can’t use x0
in our context. By convention the registers x0-x7
are caller-saved and used to pass function parameters between calls to other functions. Let’s use x9
then.
We need to modify our loader. Firstly let’s change the jumptable structure.
loader.c:
...
struct ext_jump {
uint32_t instr[4];
};
...
As we saw above, we need four instructions: mov
, movk
, movk
, br
. We don’t need a stack frame as we aren’t preserving any local variables. We just want to load the address into the register and branch to it. But we can’t write human-readable code
e.g. mov x0, #0xc14
into instructions, we need machine binary or hex representation, e.g. d2818280
.
Let’s write a simple assembly code to get it:
hw.s:
.global _start
_start: mov x9, #0xc14
movk x9, #0x5555, lsl #16
movk x9, #0x5555, lsl #32
br x9
$ as -o hw.o hw.s
$ objdump --disassemble --section=.text hw.o
hw.o: file format elf64-littleaarch64
Disassembly of section .text:
0000000000000000 <_start>:
0: d2818289 mov x9, #0xc14 // #3092
4: f2aaaaa9 movk x9, #0x5555, lsl #16
8: f2caaaa9 movk x9, #0x5555, lsl #32
c: d61f0120 br x9
Almost done! But there’s one more thing to consider. Even if the value 0x555555550c14
is a real my_puts
wrapper address, it will be different on each run if ASLR(Address space layout randomization) is enabled. We need to patch these instructions to put the value which will be returned by lookup_ext_function
on each run. We’ll split the obtained value in three parts, 16-bits each, and replace them in our mov
and movk
instructions according to the documentation, similar to what we did before for our second part.
if (symbols[symbol_idx].st_shndx == SHN_UNDEF) {
static int curr_jmp_idx = 0;
uint64_t addr = lookup_ext_function(strtab + symbols[symbol_idx].st_name);
uint32_t mov = 0b11010010100000000000000000001001 | ((addr << 48) >> 43);
uint32_t movk1 = 0b11110010101000000000000000001001 | (((addr >> 16) << 48) >> 43);
uint32_t movk2 = 0b11110010110000000000000000001001 | (((addr >> 32) << 48) >> 43);
jumptable[curr_jmp_idx].instr[0] = mov; // mov x9, #0x0c14
jumptable[curr_jmp_idx].instr[1] = movk1; // movk x9, #0x5555, lsl #16
jumptable[curr_jmp_idx].instr[2] = movk2; // movk x9, #0x5555, lsl #32
jumptable[curr_jmp_idx].instr[3] = 0xd61f0120; // br x9
symbol_address = (uint8_t *)(&jumptable[curr_jmp_idx].instr[0]);
curr_jmp_idx++;
} else {
symbol_address = section_runtime_base(§ions[symbols[symbol_idx].st_shndx]) + symbols[symbol_idx].st_value;
}
uint32_t val;
switch (type)
{
case R_AARCH64_CALL26:
/* The mask separates opcode (6 bits) and the immediate value */
uint32_t mask_bl = (0xffffffff << 26);
/* S+A-P, divided by 4 */
val = (symbol_address + relocations[i].r_addend - patch_offset) >> 2;
/* Concatenate opcode and value to get final instruction */
*((uint32_t *)patch_offset) &= mask_bl;
val &= ~mask_bl;
*((uint32_t *)patch_offset) |= val;
break;
...
In the code we took the address of the first instruction &jumptable[curr_jmp_idx].instr[0]
and wrote it in the symbol_address
, further because the type
is still R_AARCH64_CALL26
it will be put into bl
- jump to the relative address. Where our relative address is the first mov
instruction. The whole jumptable
code will be executed and finished with the blr
instruction.
The final run:
$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42
Executing say_hello...
my_puts executed
Hello, world!
The final code is here.
Summary
There are several things we covered in this blog post. We gave a brief introduction on how the binary got executed on Linux and how all components are linked together. We saw a big difference between x86 and aarch64 assembly. We learned how we can hook into the code and change its behavior. But just as it was said in the first blog post of this series, the most important thing is to remember to always think about security first. Processing external inputs should always be done with great care. Bounds and integrity checks have been omitted for the purposes of keeping the examples short, so readers should be aware that the code is not production ready and is designed for educational purposes only.
We protect entire corporate networks, help customers build Internet-scale applications efficiently, accelerate any website or Internet application, ward off DDoS attacks, keep hackers at bay, and can help you on your journey to Zero Trust.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.