Windows is overcommitting stacks

Discussion:

(too old to reply)

Bonita Montero

2020-04-14 07:08:49 UTC

I've written a little test-program that shows that Windows
has overcommitting, but only for stacks:

#include <Windows.h>
#include <Psapi.h>
#include <iostream>
#include <vector>
#include <thread>
#include <malloc.h>

using namespace std;

#pragma warning(disable: 6255) // alloca might stack overflow
#pragma warning(disable: 6031) // return value ignored

int main()
{
auto printPrivateUsage = [&]( char const *str )
{
PROCESS_MEMORY_COUNTERS_EX pmcx;
if( !GetProcessMemoryInfo( GetCurrentProcess(),
(PROCESS_MEMORY_COUNTERS *)&pmcx, sizeof pmcx ) )
return;
cout << (pmcx.PrivateUsage / 1024 / 1024) << str << endl;
};
printPrivateUsage( " MB before 512MB allocated" );
cout << "press return to allocate 512MB" << endl;
getchar();
VirtualAlloc( nullptr, 512 * 1024 * 1024, MEM_RESERVE | MEM_COMMIT,
PAGE_READWRITE );
printPrivateUsage( " MB after memory allocated (pages not touched !)
and before thread created" );
cout << "press return to create threads" << endl;
getchar();
unsigned const N_THREADS = 1000;
HANDLE hSemTouch = CreateSemaphore( NULL, 0, N_THREADS, NULL ),
hSemAllocd = CreateSemaphore( NULL, 0, N_THREADS, NULL ),
hSemGoodbye = CreateSemaphore( NULL, 0, N_THREADS, NULL );
auto thr = [hSemTouch, hSemAllocd, hSemGoodbye]()
{
char a[1];
WaitForSingleObject( hSemTouch, INFINITE );
__try
{
for( char volatile *p = a; ; *p-- );
}
__except( EXCEPTION_EXECUTE_HANDLER )
{
}
ReleaseSemaphore( hSemAllocd, 1, NULL );
WaitForSingleObject( hSemGoodbye, INFINITE );
};
vector<thread> vt;
for( unsigned t = N_THREADS; t; --t )
vt.emplace_back( thr );
printPrivateUsage( " MB after thread created" );
cout << N_THREADS << " threads created; press return to allocate in
thread-stacks" << endl;
getchar();
ReleaseSemaphore( hSemTouch, N_THREADS, NULL );
for( unsigned t = N_THREADS; t; --t )
WaitForSingleObject( hSemAllocd, INFINITE );
printPrivateUsage( " MB after threads have allocated" );
cout << "stacks read; press return terminate threads" << endl;
getchar();
ReleaseSemaphore( hSemGoodbye, N_THREADS, NULL );
for( thread &t : vt )
t.join();
}

Kaz Kylheku

2020-04-14 17:30:11 UTC

Permalink

Post by Bonita Montero
I've written a little test-program that shows that Windows

That makes sense, since threads often don't use anywhere near
the full available stack size; it would be a waste of RAM
to commit all the reserved VM.

Bonita Montero

2020-04-15 05:40:53 UTC

Permalink

Post by Kaz Kylheku

Post by Bonita Montero
I've written a little test-program that shows that Windows

That makes sense, since threads often don't use anywhere near
the full available stack size; it would be a waste of RAM
to commit all the reserved VM.

Overcommitting is a dirty technique. Windows could subtract the size
of the stacks from the available paging-file size to satisfy the memory
needs by swapping if there not enough memory.

Kaz Kylheku

2020-04-15 18:56:11 UTC

Permalink

Post by Bonita Montero

Post by Kaz Kylheku

Post by Bonita Montero
I've written a little test-program that shows that Windows

That makes sense, since threads often don't use anywhere near
the full available stack size; it would be a waste of RAM
to commit all the reserved VM.

Overcommitting is a dirty technique. Windows could subtract the size
of the stacks from the available paging-file size to satisfy the memory
needs by swapping if there not enough memory.

That's a fair statement, but note that since each thread has a stack,
if the system has thousands of threads, that's a lot of virtual memory to
commit. Many gigabytes may need to be added to the swap file.

The MSDN documentation seems to be contradicting your finding, or
rather augmenting it with detailed information. It confirms that
overcommitting /is/ going on, but in fact there is a mixture of
commit and overcommit:

https://docs.microsoft.com/en-us/windows/win32/procthread/thread-stack-size

"Each new thread or fiber receives its own stack space consisting of
both reserved and initially committed memory. The reserved memory size
represents the total stack allocation in virtual memory. As such, the
reserved size is limited to the virtual address range. The initially
committed pages do not utilize physical memory until they are
referenced; however, they do remove pages from the system total commit
limit, which is the size of the page file plus the size of the physical
memory."

[ ... ]

"The default size for the reserved and initially committed stack memory
is specified in the executable file header. Thread or fiber creation
fails if there is not enough memory to reserve or commit the number of
bytes requested. The default stack reservation size used by the linker
is 1 MB. To specify a different default stack reservation size for all
threads and fibers, use the STACKSIZE statement in the module definition
(.def) file. The operating system rounds up the specified size to the
nearest multiple of the system's allocation granularity (typically 64
KB). To retrieve the allocation granularity of the current system, use
the GetSystemInfo function."

"To change the initially committed stack space, use the dwStackSize
parameter of the CreateThread, CreateRemoteThread, or CreateFiber
function."

OK, so

1. By default, 1M is reserved, but not committed.

2. The bytes specified in CreateThread are rounded up to a page
and committed, else the thread won't start.

3. The wording in the documentation is too borked to explain
what is the default commit size if CreateThread specifies
zero.

4. Since the default reserved stack size is 1M, if
we have 3000 threads in the system and they all have this
default size, we would need 3G of virtual memory to commit
them all; it's not really such a hot idea. Applications should use
the thread-creation parameters to commit what they need,
and possibly configure the defaults in the program header.

--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List: http://www.kylheku.com/diy
ADA MP-1 Mailing List: http://www.kylheku.com/mp1

Bonita Montero

2020-04-16 18:26:52 UTC

Permalink

Post by Kaz Kylheku
That's a fair statement, but note that since each thread has a stack,
if the system has thousands of threads, that's a lot of virtual memory
to commit. Many gigabytes may need to be added to the swap file.

A swapfile can be assigned to a slow mechanical harddive which costs
almost nothing.

Post by Kaz Kylheku
2. The bytes specified in CreateThread are rounded up to a page
and committed, else the thread won't start.

Yes, but the rest is overcommitted.

Post by Kaz Kylheku
3. The wording in the documentation is too borked to explain
what is the default commit size if CreateThread specifies
zero.

Test it by geting the stack limits and try to write beyond where
you guess the guard-page is.

Post by Kaz Kylheku
4. Since the default reserved stack size is 1M, if
we have 3000 threads in the system and they all have this
default size, we would need 3G of virtual memory to commit
them all; it's not really such a hot idea.

My idea was to subtract this memory from the paging-file.

Post by Kaz Kylheku
them all; it's not really such a hot idea. Applications should
use the thread-creation parameters to commit what they need,
and possibly configure the defaults in the program header.

In many languages you can't specify the stack commit.
A std::thread in C++ f.e. gets the defaults.

Bonita Montero

2020-04-16 19:22:22 UTC

Permalink

This post might be inappropriate. Click to display it.

Bonita Montero

2020-04-16 19:23:22 UTC

Permalink

No, three pages; the guard-page is hit at -0x4000 below the
stack-top i.e. the last initialized page is at -0x3000.

Bonita Montero

2020-04-23 16:48:02 UTC

Permalink

Post by Bonita Montero
No, three pages; the guard-page is hit at -0x4000 below the
stack-top i.e. the last initialized page is at -0x3000.

There's another interesting issue: when you do alloca() or have
a larger array on the stack, the code generated calls the function
__chkstk(). It touches the pages to the bottom of the new stackpoin-
ter downwards to shift the guard-page downwards. But in 64 bit mode
there's an element in the thread-control-block, adressed at gs:[010h],
which tells __chkstk where the current stack bottom is. So __chkstk
compares the new stack-pointer with this value and touches only those
pages below that "high water mark". This pointer is updated by the
kernel when it shifts the guard-page downwards.
What's interesting here is that this pointer is at the beginning
of the first page when the thread starts. But when I scan the stack
from the bottommost page upwards to the guard page, the guard-page
is actually hit at the fifth page. So the gs:-pointer doesn't seem
to be "physically" correct.