Common Design Notes
Design notes common to all hooking strategies.
Wrappers
Wrapper
Wrappers are stubs which convert from the calling convention of the original function to your calling convention.
If the calling convention of the hooked function and your function matches, this wrapper is simply just 1 jmp
instruction.
Wrappers are documented in their own page here.
ReverseWrapper
Stub which converts from your code's calling convention to original function's calling convention
This is basically Wrapper with source
and destination
swapped around
Hook Memory Layouts & Thread Safety
Hooks in reloaded-hooks-rs
are structured in a very specific way to ensure thread safety.
They sacrifice a bit of memory usage in favour of performance + thread safety.
Most hooks, regardless of type have a memory layout that looks something like this:
// Size: 2 registers
pub struct Hook
{
/// The address of the stub containing bridging code
/// between your code and custom code. This is the address
/// of the code that will actually be executed at runtime.
stub_address: usize,
/// Address of the 'properties' structure, containing
/// the necessary info to manipulate the data at stub_address
props: NonNull<StubPackedProps>,
}
Notably, there are two heap allocations. One at stub_address
, which contains the executable code,
and one at props
, which contains packed info of the stub at stub_address
.
The hooks use a 'swapping' system. Both stub_address
and props
contains swap space
. When you
enable or disable a hook, the data in the two 'swap spaces' are swapped around.
In other words, when stub_address
' 'swap space' contains the code for HookFunction
(hook enabled),
the 'swap space' at props
' contains the code for Original Code
.
Thread safety is ensured by making writes within the stub itself atomic, as well as making the emplacing of the jump to the stub in the original application code atomic.
Stub Layout
The memory region containing the actual executed code.
The stub has two possible layouts, if the Swap Space
is small enough such that it can be atomically
overwritten, it will look like this:
- 'Swap Space' [HookCode / OriginalCode]
<pad to atomic register size>
Otherwise, if Swap Space
cannot be atomically overwritten, it will look like:
- 'Swap Space' [HookCode / OriginalCode]
- HookCode
- OriginalCode
Some hooks may store, extra data after OriginalCode
.
For example, if calling convention conversion is needed, the HookCode
becomes a
ReverseWrapper, and the stub will also contain a Wrapper.
If calling convention conversion is needed, the layout looks like this:
- 'Swap Space' [ReverseWrapper / OriginalCode]
- ReverseWrapper
- OriginalCode
- Wrapper
Example (When Atomically Overwriteable)
Using ARM64 Assembly Hook as an example.
If the 'OriginalCode' was:
mov x0, x1
add x0, x2
And the 'HookCode' was:
add x1, x1
mov x0, x2
Since the size of the swap space is less than 16 bytes (assuming 4 byte instructions), the memory would look like this when the hook is enabled:
swap: ; Currently Applied (Hook)
add x1, x1
mov x0, x2
b back_to_code ; 12 bytes total
Example (Not Atomically Overwriteable)
Now, let's consider an example where the swap space is larger than the amount of bytes that can be atomically written (over 16 bytes, in this ARM64 case)
If the 'OriginalCode' was:
mov x0, x1
add x0, x2
sub x0, x3
mul x0, x4
add x0, x5
And the 'HookCode' was:
add x1, x1
mov x0, x2
sub x1, x3
mul x1, x4
add x1, x5
The memory would look like this when the hook is enabled:
swap: ; Currently Applied (Hook)
add x1, x1
mov x0, x2
sub x1, x3
mul x1, x4
add x1, x5
b back_to_code ; 24 bytes total
hook: ; HookCode
add x1, x1
mov x0, x2
sub x1, x3
mul x1, x4
add x1, x5
b back_to_code
original: ; OriginalCode
mov x0, x1
add x0, x2
sub x0, x3
mul x0, x4
add x0, x5
b back_to_code
Therefore, the hook
and original
code are stored separately. When the hook is being enabled/disabled the
swap
space will contain a temporary branch to either the hook
or original
before being overwritten.
(To support atomic hook/unhook)
Heap (Props) Layout
Each Assembly Hook contains a pointer to the heap stub (seen above) and a pointer to the heap.
The heap contains all information required to perform operations on the stub.
- StubPackedProps
- Enabled Flag
- IsSwapOnly
- SwapSize
- HookSize
- [Hook Function / Original Code]
The data in the heap contains a short `StubPackedProps`` struct, detailing the data stored over in the stub.
The SwapSize
contains the length of the 'swap' info (and also consequently, offset of HookCode
).
The HookSize
contains the length of the 'hook' instructions (and consequently, offset of OriginalCode
).
If the IsSwapOnly
flag is set, then this data is to be atomically overwritten.
The 'Enable' / 'Disable' Process
When transitioning between Enabled/Disabled state, we place a temporary branch at entry
, this allows us to manipulate the remaining code safely.
Using ARM64 Assembly Hook as an example.
We start the 'disable' process with a temporary branch:
entry: ; Currently Applied (Hook)
b original ; Temp branch to original
mov x0, x2
b back_to_code
hook: ; Backup (Hook)
add x1, x1
mov x0, x2
b back_to_code
original: ; Backup (Original)
mov x0, x1
add x0, x2
b back_to_code
Don't forget to clear instruction cache on non-x86 architectures which need it.
This ensures we can safely overwrite the remaining code...
Then we overwrite entry
code with hook
code, except the branch:
entry: ; Currently Applied (Hook)
b original ; Branch to original
add x0, x2 ; overwritten with 'original' code.
b back_to_code ; overwritten with 'original' code.
hook: ; Backup (Hook)
add x1, x1
mov x0, x2
b back_to_code
original: ; Backup (Original)
mov x0, x1
add x0, x2
b back_to_code
And lastly, overwrite the branch.
To do this, read the original sizeof(nint)
bytes at entry
, replace branch bytes with original bytes
and do an atomic write. This way, the remaining instruction is safely replaced.
entry: ; Currently Applied (Hook)
add x1, x1 ; 'original' code.
add x0, x2 ; 'original' code.
b back_to_code ; 'original' code.
original: ; Backup (Original)
mov x0, x1
add x0, x2
b back_to_code
hook: ; Backup (Hook)
add x1, x1
mov x0, x2
b back_to_code
This way we achieve zero overhead CPU-wise, at expense of some memory.
Limits
Stub info is packed by default to save on memory space. By default, the following limits apply:
Property | 4 Byte Instruction (e.g. ARM64) | Other (e.g. x86) |
---|---|---|
Max Orig Code Length | 128KiB | 32KiB |
Max Hook Code Length | 128KiB | 32KiB |
These limits may increase in the future if additional required functionality warrants extending metadata length.
Thread Safety on x86
Thread safety is 'theoretically' not guaranteed for every possible x86 processor, however is satisfied for all modern CPUs.
The information below is x86 specific but applies to all architectures with a non-fixed instruction size. Architectures with fixed instruction sizes (e.g. ARM) are thread safe in this library by default.
The Theory
If the
jmp
instruction emplaced when switching state overwrites what originally were multiple instructions, it is theoretically possible that the placing thejmp
will make the instruction about to be executed invalid.
For example if the previous instruction sequence was:
0x0: push ebp
0x1: mov ebp, esp ; 2 bytes
And inserting a jmp produces:
0x0: jmp disabled ; 2 bytes
It's possible that the CPU's Instruction Pointer was at 0x1
at the time of the overwrite, making the
mov ebp, esp
instruction invalid.
What Happens in Practice
In practice, modern x86 CPUs (1990 onwards) from Intel, AMD and VIA prefetch instruction in batches of 16 bytes.
And in the recent years, this has been increased to 32 bytes.
We place our stubs generated by the various hooks on 32-byte boundaries for this (and optimisation) reasons.
So, by the time we change the code, the CPU has already prefetched the instructions we are atomically overwriting.
In other words, it is simply not possible to perfectly time a write such that a thread at
Instruction Pointer 0x1
(mov ebp, esp
) [as in example above] would read an invalid instruction.
Because that instruction was prefetched and is being executed from local thread cache.
What is Safe
Here is a thread safety table for x86, taking the above into account:
Safe? | Hook | Notes |
---|---|---|
✅ | Function | Functions start on multiples of 16 on pretty much all compilers, per Intel Optimisation Guide. |
✅ | Branch | Stubs are 16 aligned. |
✅ | Assembly | Stubs are 16 aligned. |
✅ | VTable | VTable entries are usize aligned, and don't cross cache boundaries. |
Hook Length Mismatch Problem
When a hook is already present, and you wish to stack that hook over the existing hook, certain problems might arise.
When your hook is shorter than original.
This is notably an issue when a hook entry composes of more than 1 instruction; i.e. on RISC architectures.
There is a potential register allocation caveat in this scenario.
Pretend you have the following ARM64 function:
ADD x1, #5
ADD x2, #10
ADD x0, x1, x2
ADD x0, x0, x0
RET
x1 = x1 + 5;
x2 = x2 + 10;
int x0 = x1 + x2;
x0 = x0 + x0;
return x0;
And then, a large hook using an absolute jump with register is applied:
# Original instructions here replaced
MOVZ x0, A
MOVK x0, B, LSL #16
MOVK x0, C, LSL #32
MOVK x0, D, LSL #48
B x0
# <= branch returns here
If you then try to apply a smaller hook after applying the large hook, you might run into the following situation:
# The 3 instructions here are an absolute jump using pointer.
adrp x9, [0]
ldr x9, [x9, 0x200]
br x9
# Call to original function returns here, back to then branch to previous hook
MOVK x0, D, LSL #48
B x0
This is problematic, with respect to register allocation. Absolute jumps on some RISC platforms like ARM will always require the use of a scratch register.
But there is a risk the scratch register used is the same register (x0
) as the register used by the
previous hook as the scratch register. In which case, the jump target becomes invalid.
Resolution Strategy
- Prefer absolute jumps without scratch registers (if possible).
- Detect
mov
+branch
combinations for each target architecture.- And extend the function's stolen bytes to cover the entirety.
- This avoids the scratch register duplication issue, as original hook code will branch to its own code before we end up using the same scratch register.
When your hook is longer than original.
Only applies to architectures with variable length instructions. (x86)
Some hooking libraries don't clean up remaining stolen bytes after installing a hook.
Very notably Steam does this for rendering (overlay) and input (controller support).
Consider the original function having the following instructions:
48 8B C4 mov rax, rsp
48 89 58 08 mov [rax + 08], rbx
After Steam hooks, it will leave the function like this
E9 XX XX XX XX jmp 'somewhere'
58 08 <invalid instruction. leftover from state before>
If you're not able to install a relative hook, e.g. need to use an absolute jump
FF 25 XX XX XX XX jmp ['addr']
The invalid instructions will now become part of the 'stolen' bytes, when you call the original; and invalid instructions may be executed.
Resolution Strategy
This library must do the following:
- Prefer shorter hooks (
relative jump
overabsolute jump
) when possible. - Leave nop(s) after placing any branches, to avoid leaving invalid instructions.
- Don't contribute to the problem.
There unfortunately isn't much we can do to detect invalid instructions generated by other hooking libraries reliably, best we can do is try to avoid it by using shorter hooks. Thankfully this is not a common issue given most people use the 'popular' libraries.
Fallback Strategies
Return Address Patching
This feature will not be ported over from legacy Reloaded.Hooks
, until an edge case is found that requires this.
This section explains how Reloaded handles an edge case within an already super rare case.
This topic is a bit more complex, so we will use x86 as example here.
For any of this to be necessary, the following conditions must be true:
- An existing relative jump hook exists.
- Reloaded can't find free memory within relative jump range.
- The existing hook was somehow able to find free memory in this range, but we can't... (<= main reason this is improbable!!)
- Free Space from Function Alignment Strategy fails.
- The instructions at beginning of the hooked function happened to just perfectly align such that our hook jump is longer than the existing one.
The low probability of this happening, at least on Windows and/or Linux is rather insane. It cannot be estimated, but if I were to have a guess, maybe 1 in 1 billion. You'd be more likely to die from a shark attack.
In any case, when this happens, Reloaded performs return address patching.
Suppose a foreign hooking library hooks a function with the following prologue:
55 push ebp
89 e5 mov ebp, esp
00 00 add [eax], al
83 ec 20 sub esp, 32
...
After hooking, this code would look like:
E9 XX XX XX XX jmp 'somewhere'
<= existing hook jumps back here when calling original (this) function
83 ec 20 sub esp, 32
...
When the prologue is set up 'just right', such that the existing instrucions divide perfectly
into 5 bytes, and we need to insert a 6 byte absolute jmp FF 25
, Reloaded must patch the return address.
Reloaded has a built in patcher for this super rare scenario, which detects and attempts to patch return addresses of the following patterns:
Where nop* represents 0 or more nops.
1. Relative immediate jumps.
nop*
jmp 0x123456
nop*
2. Push + Return
nop*
push 0x612403
ret
nop*
3. RIP Relative Addressing (X64)
nop*
JMP [RIP+0]
nop*
This patching mechanism is rather complicated, relies on disassembling code at runtime and thus won't be explained here.
Different hooking libraries use different logic for storing callbacks. In some cases alignment of code (or rather lack thereof) can also make this operation unreliable, since we rely on disassembling the code at runtime to find jumps back to end of hook. The success rate of this operation is NOT 100%
Requirements for External Libraries to Interoperate
While I haven't studied the source code of other hooking libraries before, I've had no issues in the past with the common Detours and minhook libraries that are commonly used
Hooking Over Reloaded Hooks
Libraries which can safely interoperate (stack hooks ontop) of Reloaded Hooks Hooks' must satisfy the following.
-
Must be able to patch (re-adjust) relative jumps.
- In some cases when assembling call to original function, relative jump target may be out of range, compatible hooking software must handle this edge case.
-
Must be able to automatically determine number of bytes to steal from original function.
- This makes it possible to interoperate with the rare times we do a absolute jump when it may not be possible to do a relative jump (i.e.) as we cannot allocate memory in close enough proximity.
Reloaded Hooks hooking over Existing Hooks
See: Code Relocation