Visual Leak Detector Stack Overflow And Thread Local Storage

Table of Contents

1 Abstract

On Windows, by default Thread Local Storage (TLS) has 0x40 slots. Later it is expanded by 0x400 more slots. These expansion slots are created on demand when TLS APIs `TlsAlloc/TlsFree/TlsGetValue/TlsSetValue` are invoked in current thread. And the expansion is achieved by calling Windows API `RtlAllocateHeap`. When TLS APIs are invoked the first time with slot number >= 0x40 in a thread which hasn't expand before, TLS expansion will be triggered.

Visual Leak Detector (VLD) is a memory allocation tracing library, assisting troubleshooting memory leak issues by hooking memory allocation APIs, including `RtlAllocateHeap`.

When a memory allocation happens, VLD will store the information in its TLS slot to avoid contention and reduce performance impact. When VLD is assigned with a TLS slot >= 0x40 and a memory allocation happens in a thread that hasn't expand TLS, the access to TLS slot from VLD will trigger another TLS expansion.

However, the expansion will call `RtlAllocateHeap` to allocate memory and the record of this memory allocation will be saved into TLS slot by VLD, which will again trigger another TLS expansion… In this way, The program will enter infinite recursion.

2 Analysis

Today I meet a crash because of infinite recursion with call stacks switching between KernelBase.dll and VLD.dll. It appears to be somehow VLD enters an infinite recursion when calling `VisualLeakDetector::getTls()`. What's more interesting is despite the fact that this crash is repeatable, VLD has been enabled in our product for long.

Why it happens now?

The call stack looks like below:

...
6dc9 0759f314 09ddbfcc 00000042 0a16b870 09defbd0 KERNELBASE!TlsSetValue+0x4f
6dca 0759f360 09ddc1a3 09defbd0 00490000 0759f38c vld!VisualLeakDetector::getTls+0xfc [c:\build\vld\v24c\src\vld.cpp @ 1075]
6dcb 0759f370 09dda74b 0baa5930 00000008 0759f69c vld!VisualLeakDetector::enabled+0x23 [c:\build\vld\v24c\src\vld.cpp @ 982]
6dcc 0759f38c 75c445cf 00490000 00000008 00001000 vld!VisualLeakDetector::_HeapAlloc+0x2b [c:\build\vld\v24c\src\vld_hooks.cpp @ 1617]
6dcd 0759f3a8 09ddbfcc 00000042 0a16b870 09defbd0 KERNELBASE!TlsSetValue+0x4f
6dce 0759f3f4 09ddc1a3 09defbd0 05480fe8 0759f420 vld!VisualLeakDetector::getTls+0xfc [c:\build\vld\v24c\src\vld.cpp @ 1075]
6dcf 0759f404 09dda74b 05480fd0 0000004c 0759f45c vld!VisualLeakDetector::enabled+0x23 [c:\build\vld\v24c\src\vld.cpp @ 982]
6dd0 0759f420 7614ea43 00490000 00000000 00000018 vld!VisualLeakDetector::_HeapAlloc+0x2b [c:\build\vld\v24c\src\vld_hooks.cpp @ 1617]
6dd1 0759f434 7614ea5f 762466bc 00000018 0759f450 ole32!CRetailMalloc_Alloc+0x16 [d:\w7rtm\com\ole32\com\class\memapi.cxx @ 641]
...
6dec 0759fde0 77419882 0546b610 718c8e2a 00000000 kernel32!BaseThreadInitThunk+0xe
6ded 0759fe20 77419855 7612d854 0546b610 00000000 ntdll!__RtlUserThreadStart+0x70
6dee 0759fe38 00000000 7612d854 0546b610 00000000 ntdll!_RtlUserThreadStart+0x1b

When a thread starts and tries to allocate memory, VLD will trace all memory allocations. VLD records allocation information in a TLS slot to avoid contention and reduce performance impact. When VLD tries to initialize the TLS slot, Windows API `TlsSetValue` allocates memory, and VLD traces the allocation by saving it into TLS. This is what we can guess from the crash dump.

But why `TlsSetValue` will allocate memory? Let's take a look at the disassembled code: (uninteresting details are omited, indentations are added for better formating and understanding)

KERNELBASE!TlsSetValue:
// BOOL WINAPI TlsSetValue(_In_ DWORD dwTlsIndex, _In_opt_ LPVOID lpTlsValue)
...
75c44585 56              push    esi
75c44586 648b3518000000  mov     esi,dword ptr fs:[18h]
// ESI points to Thread Environment Block (TEB)
75c4458d 57              push    edi
75c4458e 8b7d08          mov     edi,dword ptr [ebp+8]
75c44591 83ff40          cmp     edi,40h
75c44594 7260            jb      KERNELBASE!TlsSetValue+0x76 (75c445f6)  Branch

// if (dwTlsIndex >= 0x40) {

  KERNELBASE!TlsSetValue+0x16:
  75c44596 83ef40          sub     edi,40h
  75c44599 81ff00040000    cmp     edi,400h
  75c4459f 734e            jae     KERNELBASE!TlsSetValue+0x6f (75c445ef)  Branch

  // if (dwTlsIndex < 0x440) {

    KERNELBASE!TlsSetValue+0x21:
    75c445a1 8b86940f0000    mov     eax,dword ptr [esi+0F94h]
    75c445a7 85c0            test    eax,eax
    75c445a9 753c            jne     KERNELBASE!TlsSetValue+0x67 (75c445e7)  Branch

    // if (pTeb->TlsExpansionSlots == NULL)

      KERNELBASE!TlsSetValue+0x2b:
      75c445ab e88126ffff      call    KERNELBASE!KernelBaseGetGlobalData (75c36c31)
      75c445b0 8b402c          mov     eax,dword ptr [eax+2Ch]
      75c445b3 648b0d18000000  mov     ecx,dword ptr fs:[18h]
      75c445ba 6800100000      push    1000h
      75c445bf 83c808          or      eax,8
      75c445c2 50              push    eax
      75c445c3 8b4130          mov     eax,dword ptr [ecx+30h]
      75c445c6 ff7018          push    dword ptr [eax+18h]
      75c445c9 ff151810c375    call    dword ptr [KERNELBASE!_imp__RtlAllocateHeap (75c31018)]
      75c445cf 85c0            test    eax,eax
      75c445d1 750e            jne     KERNELBASE!TlsSetValue+0x61 (75c445e1)  Branch
      ...
      KERNELBASE!TlsSetValue+0x61:
      75c445e1 8986940f0000    mov     dword ptr [esi+0F94h],eax

      // pTeb->TlsExpansionSlots = RtlAllocateHeap(...)

    KERNELBASE!TlsSetValue+0x67:
    75c445e7 8b4d0c          mov     ecx,dword ptr [ebp+0Ch]
    75c445ea 890cb8          mov     dword ptr [eax+edi*4],ecx
    75c445ed eb11            jmp     KERNELBASE!TlsSetValue+0x80 (75c44600)  Branch

    // Set value to TLS slot
  }  else { // ERROR if (dwTlsIndex >= 0x440) }
} // if (dwTlsIndex >= 0x40)

From the disassembled code we can find that the expansion occurs when the passed-in slot >= 0x40. This confirmed our guess. But since VLD will enter infinite recursion when it sees a TLS slot >= 0x40, why the stack overflow is never observed before?

This question has puzzled me for quite some time: the expansion is already done in `TlsAlloc` for all threads, why memory allocation is still needed? Until I notice that the expansion is only applied to a Thread Environment Block, (TEB, where `fs:[18h]` points to) which means the expansion only affect the current thread. All other threads remain intact.

3 Reproduce

After above analysis, we can try to reproduce the infinite recursion so that we can prove the conclusion is correct.

To increase TLS slot number, simply calling `TlsAlloc` in a loop will do the trick.

Another step required to reproduce is to make sure VLD gets assigned with a TLS slot >= 64. To do this, first increase TLS slot number, then load and enable VLD dynamically using `LoadLibrary` & `GetProcAddress`. After VLD is enabled, allocate memory in a new thread.

Here is the minimum reproduce code sample:

#include <windows.h>
#include <thread>
#include <chrono>
#include <memory>

int main(int argc, wchar_t *argv[]) {
  for (int i = 0; i < 0x40; ++i)
    TlsAlloc();

  HMODULE h_vld = LoadLibraryA("vld.dll");
  typedef void(*vld_enable_t)(void);
  auto vld_enable = (vld_enable_t)::GetProcAddress(h_vld, "VLDGlobalEnable");
  vld_enable();
  std::thread([]() {std::make_shared<int>(); }).join();
  return 0;
}

Use above code we can make a program crash of stack overflow, with call stack almost the same as what I get from our product crash dump. (Except for the first several lines which intializes CLR) Our theory about VLD and TLS expansion is correct.

4 Conclusion

How to explain why the crash is never met before?

VLD is linked to several DLLs of our product, and is statically initialized in these DLLs. The DLLs are loaded dynamically at runtime when they are used, or never if they are not touched at runtime. Maybe some recent change removed VLD from one of the DLLs which is always loaded before TLS allocation count reaches 0x40, so sometimes when VLD is loaded, there are chances that more than 0x40 TLS slots allocated already. That's why VLD gets assigned with TLS slot >= 0x40.

Also we used lots of COM (both STA/MTA) in our product, threads and memory allocation are not rare.

When these 2 conditions are met, infinite recursion happens.

5 References

Thanks to Ken Johnson, his post of Thread Local Storage, part 2: Explicit TLS 1 shed light when I am wondering why this issue still happens when TLS expansion is checked inside each API.

I submit an bug for VLD of this issue. 2

6 Acknowledgement

Thanks to my smart & beautiful scientist girlfriend Z for her patience and time to review this post and give insightful comments.

Footnotes:

1

Thread Local Storage, part 2: Explicit TLS by Ken Johnson, http://www.nynaeve.net/?p=181

HomeTop