Writing an x64 Inline Hook by Hand (Without Reaching for MinHook)

Most people’s first inline hook uses MinHook or Detours and calls it a day. That’s fine — they’re well-tested libraries and there’s no shame in using them. But if you’re writing kernel code, injecting into a process without loading extra DLLs, or just want to understand what’s actually happening under the abstraction, you need to know how to do this by hand.

This is the reference I wish existed when I first did it. No library dependencies, no magic. Just bytes.

What an inline hook is

An inline hook replaces the first few bytes of a target function with a jump to your code. Your code runs, does whatever it needs to, and then either calls the original function (via a trampoline that preserves the displaced bytes) or skips it entirely.

The simplest possible version in x86-32 was a 5-byte E9 xx xx xx xx — a near relative jump — because the entire address space fit in 32 bits and a ±2GB relative offset covered everything. On x64 this breaks immediately: modules load anywhere in the 64-bit address space, and the distance between them is routinely larger than 2GB.

You need a jump encoding that reaches any address.

Two encodings that actually work

5-byte near jump (E9+rel32) — works when target and hook are within ±2GB of each other. This is common when you’re hooking within the same module, or when you control the allocation address of your hook code (allocate it near the target).

nasm
E9 xx xx xx xx     ; JMP rel32 — 5 bytes, ±2GB reach

rel32 = destination - (source + 5) — the offset is relative to the next instruction, not the current one.

14-byte absolute jump (FF25 + imm64) — works everywhere, no constraints on address distance.

nasm
FF 25 00 00 00 00        ; JMP QWORD PTR [RIP+0]   — jump indirect through the next 8 bytes
xx xx xx xx xx xx xx xx  ; absolute 64-bit address

The FF 25 00 00 00 00 says “jump to the address stored at RIP+0”, and RIP at that point points immediately at the following 8 bytes. So the CPU loads the 8-byte absolute address that immediately follows in memory and jumps to it. No relative offset arithmetic, no distance constraints.

Which to use: if you can guarantee proximity (within the same module, or you allocated your hook stub with VirtualAlloc using MEM_TOP_DOWN near the target), 5 bytes is less disruptive because you displace fewer original bytes. If you can’t guarantee proximity, use 14 bytes and stop worrying about it.

Installing the hook

The process:

Read and save the first N bytes of the target (N = 5 or 14 depending on your jump size — but see “stolen bytes” below)
Make the target page writable
Write the jump bytes
Restore page protection
Flush instruction cache

c
bool hook_function(void *target, void *detour,
                   uint8_t *original_bytes_out, size_t hook_size)
{
    // Save original bytes
    memcpy(original_bytes_out, target, hook_size);

    // Make page writable
    DWORD old_protect;
    VirtualProtect(target, hook_size, PAGE_EXECUTE_READWRITE, &old_protect);

    // Write 14-byte absolute jump
    uint8_t jmp[14] = {
        0xFF, 0x25, 0x00, 0x00, 0x00, 0x00,  // JMP [RIP+0]
        0, 0, 0, 0, 0, 0, 0, 0               // absolute address
    };
    *(uint64_t *)(jmp + 6) = (uint64_t)detour;
    memcpy(target, jmp, sizeof(jmp));

    // Restore protection and flush
    VirtualProtect(target, hook_size, old_protect, &old_protect);
    FlushInstructionCache(GetCurrentProcess(), target, hook_size);

    return true;
}

In kernel code, VirtualProtect doesn’t exist. The equivalent is modifying the PTE directly (or using an MDL to get a writable mapping of the page), then writing. The FlushInstructionCache call matters — x86-64 has a split L1 cache (separate instruction and data caches), and they’re coherent on Intel but the hardware may not have propagated your write to the I-cache yet. The flush call ensures the CPU re-fetches from memory.

The trampoline: calling the original

If you want to call the original function after your hook runs, you can’t just jump to target — you’d re-enter the hook and loop forever. You need a trampoline: a buffer containing the displaced original bytes, followed by a jump back to target + hook_size.

text

Trampoline:
  [displaced original byte 0]
  [displaced original byte 1]
  ...
  [displaced original byte N]       ; copies of what was at target[0..N]
  FF 25 00 00 00 00                 ; JMP [RIP+0]
  xx xx xx xx xx xx xx xx           ; address of target+hook_size

When your detour calls the trampoline, it executes the original instructions, then jumps back into the middle of the original function, past the hook bytes. The original function continues normally.

The RIP-relative operand problem

x86-64 introduced RIP-relative addressing for data access. Instructions like:

nasm
mov rax, [rip+0x2000]   ; load from address (next_instruction + 0x2000)

When this instruction is in the original function at address 0x140001000, the effective address is 0x140001007 + 0x2000 = 0x140003007. But when you copy those bytes to your trampoline at address 0x20000000, RIP now points at a completely different location. The same instruction bytes produce a different effective address, and you’ll read from the wrong place — or crash.

Any instruction using RIP-relative addressing that you displace into a trampoline must be fixed up. The general fix:

Decode the displaced instruction (or at least identify if it has a RIP-relative operand — [rip+disp32])
Compute the original effective address: original_ip + instruction_length + disp32
Since you can’t change the disp32 to reach that address from the trampoline (it might be out of range), replace the instruction with an equivalent sequence that uses an absolute address

For a simple load like mov rax, [rip+disp32], the replacement is:

nasm
; Original (broken in trampoline):
; mov rax, [rip+0x2000]

; Replacement (works anywhere):
mov rax, 0x140003007      ; or whatever the absolute target address is
mov rax, [rax]

This is why hooking libraries like MinHook depend on a length-disassembler engine (like HDE or the Zydis library) — you need to know how long each displaced instruction is and whether it has a RIP-relative operand. If you’re writing this from scratch and don’t want a disassembler dependency, you can statically verify that the first N bytes of your specific target function don’t contain any RIP-relative instructions. This is brittle but works for controlled targets that don’t change across versions (or where you’re rebuilding the hook on each version anyway).

How many bytes to steal

The jump stub is 5 or 14 bytes. But you can’t just copy 5 or 14 bytes arbitrarily — you have to copy complete instructions. If a 5-byte jump would split an instruction in the middle, you need to copy more bytes to avoid executing a partial instruction.

The minimum: find the smallest N such that N ≥ jump_size and the first N bytes form a complete set of instructions. For a 14-byte jump on a typical function prologue:

nasm
; Common x64 prologue:
55               ; push rbp        (1 byte)
48 89 E5         ; mov rbp, rsp    (3 bytes)
48 83 EC 20      ; sub rsp, 0x20   (4 bytes)
; Total: 8 bytes — not enough. Next instruction:
48 8B 45 10      ; mov rax, [rbp+10h]  (4 bytes)
; Total: 12 bytes — still not enough. Next:
48 85 C0         ; test rax, rax   (3 bytes)
; Total: 15 bytes — 15 >= 14, stop here.

You’d copy 15 bytes into the trampoline, then jump back to target + 15. This is why a length-disassembler is genuinely useful — doing this by hand for arbitrary functions requires manually decoding instruction lengths, which is tedious and error-prone for the full x86-64 instruction set.

In the kernel

Kernel-mode hooking has two complications beyond page protection:

Write-protecting: Kernel text pages are typically non-writable. On a system without HVCI, you can clear the WP bit in CR0, write, and restore it. On an HVCI system, this causes a VM-exit and the hypervisor refuses the write (or BSODs you). This is why HVCI makes kernel hooking from the guest impossible.

c
// Clear WP bit — only works without HVCI
__writecr0(__readcr0() & ~0x10000ULL);
memcpy(target, hook_bytes, hook_size);
__writecr0(__readcr0() | 0x10000ULL);

SMEP/SMAP: On x64, the kernel (by default) can’t execute user-mode pages and can’t access user-mode memory without stac/clac. Your trampoline must be in kernel address space. Allocate it with ExAllocatePool(NonPagedPoolExecute, ...).

SMP coherence: On a multi-core system, flushing the instruction cache on one core doesn’t flush it on others. If another core is executing the target function at the moment you’re patching it, you have a race. The proper solution involves stopping all other CPUs (via KeIpiGenericCall or an NMI rendezvous), patching, and resuming. Most hooking frameworks don’t bother with this and get away with it because the window is tiny, but it’s technically wrong.

Detection

From a defensive perspective, inline hooks are detectable by comparing the live bytes at a function’s address against the expected bytes from the file on disk. This is what HVCI does continuously at the hypervisor level. It’s also what any decent EDR does periodically for high-value targets like NTDLL exports and kernel routines.

The evasion approach that survives this check is EPT shadow page hooks — the kernel reads original bytes, the CPU executes modified bytes. That’s covered in the EPT post and requires operating below the kernel rather than inside it.

For anything running in the guest, inline hooks are detectable given sufficient monitoring capability. The question is always whether your target’s monitoring is actually doing it.