Hooking the CRT using the Import Address Table

Continuing the memory pool implementation I needed to replace memory allocation functions such as malloc, free, new, delete etc. But first we need to understand how it works internally.

CRT as a DLL

I am using the CRT as a dynamic library, using the static library would have made things a bit easier (see Google Chrome’s way of doing this) but we are stuck with dynamic so we still need a solution.

Let’s take a look at what happens at the assembly level when we call new :

call operator new (013F2B5C0Ch)
;
13F2B5C0C jmp qword ptr [__imp_operator new (013F729508h)]

Now at 013F2B5C0Ch we see this :

jmp qword ptr [013F729508h]

So as we can see, we go call at a fixed address and this fixed address is just a jmp to somewhere else, but why ?
It turns out we are using the CRT as a DLL so when the program is linked we don’t know where the function will be in memory, that’s why the compiler generates this jumper function. On load the program will load the DLLs in memory, lookup the functions it needs and finally write the correct jmp. Before we jump into the exploit a bit of terminology :
This mechanism is the Import Address Table, and the jumper we found is in the jump thunk table.

Exploiting

Now that we understand what happens and even found a jumper function it will be very simple for us to replace new with out own version of new !
All we got to do is replace the jmp with our own function and we should be good to go (also remember to save the old address as you might need it).

#define PtrFromRva( base, rva ) ( ( ( PBYTE ) base ) + rva )

PIMAGE_DOS_HEADER dosHeader = (PIMAGE_DOS_HEADER)__ImageBase;
PIMAGE_NT_HEADERS ntHeader = (PIMAGE_NT_HEADERS)PtrFromRva(dosHeader, dosHeader->e_lfanew);

PIMAGE_IMPORT_DESCRIPTOR ImportDescriptor = (PIMAGE_IMPORT_DESCRIPTOR) 
	PtrFromRva(dosHeader, ntHeader->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT].VirtualAddress);

This will give us the import descriptor list which contains everything we import from DLLs.
We will then iterate over every entry in the descriptor list and grab the DLL’s name :

for (unsigned int index = 0; ImportDescriptor[index].Characteristics != 0; ++index)
{
	const char* dllName = (const char*)PtrFromRva( dosHeader, ImportDescriptor[index].Name);
}

Assuming we found the DLL that contains the symbol we want to hook we would look for our function :

PIMAGE_THUNK_DATA thunk = (PIMAGE_THUNK_DATA)PtrFromRva(dosHeader, ImportDescriptor[index].FirstThunk);
PIMAGE_THUNK_DATA origThunk = (PIMAGE_THUNK_DATA)PtrFromRva(dosHeader, ImportDescriptor[index].OriginalFirstThunk);

for (;origThunk->u1.Function != NULL; origThunk++, thunk++)
{
	if (origThunk->u1.Ordinal & IMAGE_ORDINAL_FLAG)	continue;
	PIMAGE_IMPORT_BY_NAME import = (PIMAGE_IMPORT_BY_NAME)PtrFromRva(dosHeader, origThunk->u1.AddressOfData);
	import->name; // this gives us the symbol name
}

So what’s with “thunk” and “origThunk” ? They are both needed, origThunk gives us the import information we need like the name of the symbol and thunk is where the actual jmp is stored. So we are using origThunk to find the symbol and once we found it thunk will contain our jmp.

Now let’s assume we found our symbol, you will need to make thunk writable using VirtualProtect and all you have to do is access

thunk->u1.Function 

It will contain the address of the jump which you can save or not and finally replace with the address of your function.

That’s it ! Now you know how to hook imported symbols, all you need to do is find the DLL’s name and the symbol’s name, I trust you know how to use a dependency walker and will not have any trouble doing so.

Advertisements
Hooking the CRT using the Import Address Table

The ABA problem and large pointer CAS

I recently started to implement a memory pool at work to improve the overall performance of our game and stumbled upon the challenges of making it thread-safe and fast.

First let me give you some context, the memory pool allocates large blocks of memory then this memory is split into smaller blocks that are inserted in a simple linked list. Note that we reuse the memory in the blocks for the nodes. The allocation process is very simple :

void* allocate()
{
   // The node becomes memory
   void* returnPtr = (void*)block->head;
   block->head = block->head->next;
   return returnPtr;
}

Nothing fancy, and the deallocation is the following :

void deallocate(void* memory)
{
   // We reuse the memory to make our node
   Node* node = (Node*)memory;
   node->next = (void*)block->head;
   block->head = node;
}

Of course this is not the actual code, I removed all the checks because we don’t care about it being functional here.

Obvious solutions

While this works fine and is fast, it is not thread safe, so the naive approach would be to add a critical section around it so we would end up with something like :

void* allocate()
{
   block->lock();
   void* returnPtr = (void*)block->head;
   block->head = block->head->next;
   block->unlock();

   return returnPtr;
}

void deallocate(void* memory)
{
   Node* node = (Node*)memory;

   block->lock();
   node->next = (void*)block->head;
   block->head = node;
   block->unlock();
}

Are we satisfied with this ? Not really, locks are very expensive and we are writing our own allocator to beat the system’s allocator, what would be the point if we are barely faster and pay the price of a memory overhead for it.

So we now remember that atomics exist (I chose not to use std.atomics for reason we will dive into later). We use the compare and swap function “CAS” that behaves like so :

bool CAS(void** target, void* value, void* compare)
{
   if(*target == compare)
   {
      *target = value;
      return true;
   }
   return false
}

A call to CAS is atomic so the operation itself is thread-safe.

Now let’s modify our previous code to use this and remove the locks :

void* allocate()
{
   void* returnPtr;
   while(true)
   {
      Node* head = block->head;
      returnPtr = (void*)head;
      void* next = head->next;
      if(CAS(&block->head, next, head)) break;
   }
   return returnPtr;
}

void deallocate(void* memory)
{
   Node* node = (Node*)memory;
   while(true)
   {
      void* head = (void*)block->head;
      node->next = head;
      if(CAS(&block->head, node, head)) break;
   }
}

Well we got it right ? Everything should work fine, our writes are atomic, if our write fails we retry, there is no locking going on, it’s fast and thread-safe right ? Well bad news it’s not thread safe because of the ABA problem, for those of you who know you may skip the following example :

Let’s take two threads : T1 and T2 and the following memory layout head -> A -> B -> C


T1 enters allocate()
head = A
next = B
T1 gets preempted


T2 enters allocate()
head = A
next = B
block->head = B
return A


T2 enters allocate()
head = B
next = C
block->head = C
return B


T2 enters deallocate(A)
A->next = C
block->head = A

So we now have the memory layout head -> A -> C, but now T1 gets some compute time again and so the CAS will succeed because head is indeed A so we are fine we can swap !


T1 resumes
block->head = B
return A

We return A, A is valid the returned pointer is valid, but the internal memory is wrong we have head -> B ?.

B has not been returned to the pool yet, we are considering B to be free and we will be reading the content in *B the next time we want to allocate, this will obviously be catastrophic.

This is the ABA problem.

Solving this

There are two ways to solve this Hazard Pointers or tagged pointers, I find tagged pointers easier to work with so I will explain how to solve this problem with tagged pointers. So what is a tagged pointer ?

We consider the tuple <pointer, tag>, the tag is just a number that we increment every-time we change the value of the pointer. If the tag has an infinite maximum (it will not wrap around) it’s trivial to show that there is no way the ABA problem can occur, it gets tricky when it can wrap around because it is still possible that the tag wraps back to the value it was at when the thread resumes before the CAS. In fact if the tag can wrap we did not solve the ABA problem. It just becomes so unlikely that assume it is solved.

So now back to code, we know we need a way to tag the pointer and also make sure the tag is large enough to say “it will never happen”. We are also limited by the CAS operation that works with nothing larger than a pointer.

The first thing that comes to mind is to align the pointer and use the lower bits of the address as the tag, it will work with our CAS and retrieving the actual pointer is as simple as :

void* ptr = taggedPtr & ~(alignment - 1);

That’s great we have a very simple tagged pointer, now remember we also want the tag to be pretty large, so what do we do, pick a large value to align to ? Not very efficient if we need to align pointer to 1024, it means we cannot allocate anything smaller than 1KB, not acceptable. We are then stuck with small alignments say 8 – 16. Is it large enough to stop worrying about the ABA ? Nope.

Large CAS

We seem to have hit the limits of the tagged pointer method, but I have been keeping the existence of the DCAS operation from you. The DCAS operation “double compare and swap” works with memory elements that are twice the size of a pointer.

Why did I hide this ? Well it’s a very uncommon operation, in fact std.atomics doesn’t implement it (you now understand why we are not using it) and windows XP does not even provide an Interlocked function for it on 32bits systems because pre-pentium CPUs did not implement this instruction. And now we also deal with 64bits systems, we then need a 128bits CAS, lucky us most modern CPUs implement this operation except for some old AMD CPU (10 years old). But hey we are writing a memory pool for our game that pushes i7s to the limit, it’s not like there is any chance 10 year old CPUs stand a chance.

Also the 128bits CAS is required to run Windows 8.1, we can assume it’s safe to use it nowadays.

I don’t think I have a lot to explain now, the tagged pointer implementation should be straight forward now, using the DCAS operation. The only pointer that needs to be tagged is the “head” pointer, it’s the only hazardous pointer in here.

If you need to implement it yourself on Windows take a look at the intrinsics (NOT the regular functions) _InterlockedCompareExchange64 for 32bits applications, and _InterlockedCompareExchange128 for 64bits applications. On linux using GCC you just need __sync_bool_compare_and_swap and the __int128 type on 64bits applications.

If you would like to know more about how this implemented in assembly, the 64bits CAS instruction is CMPXCHG8B and the 128 bits instruction is CMPXCHG16B.

Let me know if you have any remark or questions in the comments !

The ABA problem and large pointer CAS