* Ulrich Drepper <drepper@redhat.com> wrote:not sure exactly what numbers you mean, but there are lots of numbers in the first mail, attached below. For example: | As example, in one case creating new threads goes from about 35,000 | cycles up to about 25,000,000 cycles -- which is under 100 threads per | second. Larger stacks reduce the severity of slowdown but also make being able to create only 100 threads per second brings us back to 33 MHz 386 DX Linux performance. Ingo ----------------------> mmap() is slow on MAP_32BIT allocation failure, sometimes causing NPTL's pthread_create() to run about three orders of magnitude slower. As example, in one case creating new threads goes from about 35,000 cycles up to about 25,000,000 cycles -- which is under 100 threads per second. Larger stacks reduce the severity of slowdown but also make slowdown happen after allocating a few thousand threads. Costs vary with platform, stack size, etc., but thread allocation rates drop suddenly on all of a half-dozen platforms I tried. The cause is NPTL allocates stacks with code of the form (e.g., glibc 2.7 nptl/allocatestack.c): sto = mmap(0, ..., MAP_PRIVATE|MAP_32BIT, ...); if (sto == MAP_FAILED) sto = mmap(0, ..., MAP_PRIVATE, ...); That is, try to allocate in the low 4GB, and when low addresses are exhausted, allocate from any location. Thus, once low addresses run out, every stack allocation does a failing mmap() followed by a successful mmap(). The failing mmap() is slow because it does a linear search of all low-space vma's. Low-address stacks are preferred because some machines context switch much faster when the stack address has only 32 significant bits. Slow allocation was discussed in 2003 but without resolution. See, e.g., http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0321.html, http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0517.html, http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0538.html, and http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0520.html. With increasing use of threads, slow allocation is becoming a problem. Some old machines were faster switching 32b stacks, but new machines seem to switch as fast or faster using 64b stacks. I measured thread-to-thread context switches on two AMD processors and five Intel procesors. Tests used the same code with 32b or 64b stack pointers; tests covered varying numbers of threads switched and varying methods of allocating stacks. Two systems gave indistinguishable performance with 32b or 64b stacks, four gave 5%-10% better performance using 64b stacks, and of the systems I tested, only the P4 microarchitecture x86-64 system gave better performance for 32b stacks, in that case vastly better. Most systems had thread-to-thread switch costs around 800-1200 cycles. The P4 microarchitecture system had 32b context switch costs around 3,000 cycles and 64b context switches around 4,800 cycles. It appears the kernel's 64-bit switch path handles all 32-bit cases. So on machines with a fast 64-bit path, context switch speed would presumably be improved yet further by eliminating the special 32-bit path. It appears this would also collapse the task state's fs and fsindex fields, and the gs and gsindex fields. These could further reduce memory, cache, and branch predictor pressure. Various things would address the slow pthread_create(). Choices include: - Be more platform-aware about when to use MAP_32BIT. - Abandon use of MAP_32BIT entirely, with worse performance on some machines. - Change the mmap() algorithm to be faster on allocation failure (avoid a linear search of vmas). Options to improve context switch times include: - Do nothing. - Be more platform-aware about when to use different 32b and 64b paths. - Get rid of the 32b path, which also appears it would make contexts smaller. [Not] Attached is a program to measure context switch costs. --
| Ingo Molnar | Re: 2.6.24-rc6-mm1 |
| Eric W. Biederman | [PATCH] ipv4/ipvs: Convert to kthread API |
| David Woodhouse | Re: [GIT *] Allow request_firmware() to be satisfied from in-kernel, use it in mor... |
| Phil Endecott | strace, accept(), ERESTARTSYS and EINTR |
git: | |
| Shawn O. Pearce | libgit2 - a true git library |
| walt | [VOTE] git versus mercurial |
| Eric Hanchrow | Re: how to backup git |
| Andreas Ericsson | git to libgit2 code relicensing |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
| Richard Stallman | Real men don't attack straw men |
| Henning Brauer | Re: About Xen: maybe a reiterative question but .. |
| rezidue | Speed Problems |
| Jarek Poplawski | [PATCH take 2] pkt_sched: Protect gen estimators under est_lock. |
| Hannes Eder | [PATCH 00/27] drivers/net: fix sparse warnings |
| Herbert Xu | Re: incorrect cksum with tcp/udp on lo with 2.6.20/2.6.21/2.6.22 |
| Herbert Xu | UDP-Lite and /proc/net/snmp |
