A recent discussion on the Linux Kernel mailing list noted that threaded 64-bit applications suffer a drastic slowdown in pthread_create performance when stack utilization goes above 4GB. Ingo Molnar offered an explanation of the problem, "unfortunately MAP_32BIT use in 64-bit apps for stacks was apparently created without foresight about what would happen in the MM when thread stacks exhaust 4GB. The problem is that MAP_32BIT is used both as a performance hack for 64-bit apps and as an ABI compat mechanism for 32-bit apps. So we cannot just start disregarding MAP_32BIT in the kernel - we'd break 32-bit compat apps and/or compat 32-bit libraries." The original report noted that once the shared stack goes above 4GB in size, thread creation can take as long as 10 milliseconds, a slowdown described as "quite unacceptable".
Ingo created a patch introducing a new MAP_STACK flag for glibc to be used instead of MAP_32BIT and avoid imposing the 32-bit performance limitation on threaded 64-bit applications. He noted, "glibc can switch to this new flag straight away - it will be ignored by the kernel." The new flag was quickly merged upstream, and changes were planned for glibc.
Matthew Dillon continues to make significant progress on his HAMMER clustering filesystem for DragonFly BSD. He labeled the latest release 56c, noting that it, "represents an additional significant improvement in performance, [also including] bug fixes and most of the final media changes." A significant improvement in write performance was obtained by making the filesystem block size automatically increase from 16K to 64K when a file grows to larger than 1 MB. One remaining media change is required to optimize mtime and atime storage, at which point HAMMER will go into testing and bug fixing mode. Matt noted, "HAMMER's performance is extremely good now, and its system cpu overhead has dropped to roughly the same that we get from UFS", adding, "HAMMER is now able to sustain full disk bandwidth for bulk reads and writes. HAMMER continues to have far superior random-write performance, whether the system caches are blown out or not." Discussing future plans for the filesystem, Matt noted, "I could go on and on, there's so much that can be done with this filesytem :-)" Regarding one of these plans, he offered:
"I am not going to promise it, but there is a slight chance I will be able to get mirroring working by the release. I figured out how to do it, finally. Basically the solution is to add another field to the B-Tree's internal elements... the 'most recent' transaction id, and to propogate it up all the way to the root of the tree. The mirroring code can then optimally scan the B-Tree and pick out all records that have changed relative to some transaction id, allowing it to quickly 'pick up' where it left off and construct a record-level mirror over a fully asynchronous link, without any queueing. You can't get much better then that, frankly. "
"After another round of performance tuning HAMMER all my benchmarks show HAMMER within 10% of UFS's performance, and it beats the shit out of UFS in certain tests such as file creation and random write performance," noted DragonFly BSD creator Matthew Dillon, providing an update on his new clustering filesystem. He continued, "read performance is good but drops more then UFS under heavy write loads (but write performance is much better at the same time)." He then referred to the blogbench benchmark noting, "now when UFS gets past blog #300 and blows out the system caches, UFS's write performance goes completely to hell but it is able to maintain good read performance." Matthew then compared this to HAMMER:
"HAMMER is the opposite. It can maintain fairly good write performance long after the system caches have been blown out, but read performance drops to about the same as its write performance (remember, this is blogbench doing reads from random files). Here HAMMER's read performance drops significantly but it is able to maintain write performance. UFS's write performance basically comes to a dead halt. However, HAMMER's performance numbers become 'unstable' once the system caches are blown out."
Matthew Dillon sent out a series of updates about his developing HAMMER filesystem, noting that he is currently focusing on the reblocking and pruning code, tracking down a number of bugs resulting in B-Tree corruption. He also noted that previously HAMMER was comprised of three components: B-Tree nodes, records, and data. In his latest cleanups, he has entirely removed the record structure, "this will seriously improve the performance of directory and inode access." This change did require an on-media format change, "I know I have said this before, but there's a very good chance that no more on-media changes will be made after this point. The official freeze of the on-media format will not occur until the 2.0 release, however."
Matt added, "HAMMER is stable enough now that I am able to run it on my LAN backup box. I'm using it to test that the snapshots work as expected as well as to test the long term effects of reblocking and pruning." He then cautioned:
"Please note that HAMMER is not ready for production use yet, there is still the filesystem-full handling to implement and much more serious testing of the reblocking and pruning code is required, not to mention the crash recovery code. I expect to find a few more bugs, but I'm really happy with the results so far."
"HAMMER is going to be a little unstable as I commit the crash recovery code," began DragonFly BSD creator Matthew Dillon, adding, "I'm about half way through it." He went on to list what's left for crash recovery to work with HAMMER, his new clustering filesystem, "I have to flush the undo buffers out before the meta-data buffers; then I have to flush the volume header so mount can see the updated undo info; then I have to flush out the meta-data buffers that the UNDO info refers to; and, finally, the mount code must scan the UNDO buffers and perform any required UNDOs." He continued:
"The idea being that if a crash occurs at any point in the above sequence, HAMMER will be able to run the UNDOs to undo any partially written meta-data. HAMMER would be able to do this at mount-time and it would probably take less then a second, so basically this gives us our instant crash-recovery feature."
Matt went on to add that as an advantage of significantly separating the front end VFS operations from the backend I/O it would now be possible to fix several stalls in the code, significantly improving HAMMER's performance.
"I wasn't planning on releasing v0.12 yet, and it was supposed to have some initial support for multiple devices. But, I have made a number of performance fixes and small bug fixes, and I wanted to get them out there before the (destabilizing) work on multiple-devices took over," explained Chris Mason regarding the 0.12 release of his new btrfs filesytem. Btrfs was first announced in June of 2007, as an alpha-quality filesystem offering checksumming of all files and metadata, extent based file storage, efficient packing of small files, dynamic inode allocation, writable snapshots, object level mirroring and striping, and fast offline filesystem checks, among other features. The project's website explains, "Linux has a wealth of filesystems to choose from, but we are facing a number of challenges with scaling to the large storage subsystems that are becoming common in today's data centers. Filesystems need to scale in their ability to address and manage large storage, and also in their ability to detect, repair and tolerate errors in the data stored on disk." Regarding the latest release, Chris offered:
"So, here's v0.12. It comes with a shiny new disk format (sorry), but the gain is dramatically better random writes to existing files. In testing here, the random write phase of tiobench went from 1MB/s to 30MB/s. The fix was to change the way back references for file extents were hashed."
"I'd like to send a small update on my progress on the Performance Tracker project," noted Erik Cederstrand on the FreeBSD -current mailing list. He continued, "I now have a small setup of a server and a slave chugging along, currently collecting data. I'm following CURRENT and collecting results from super-smack and unixbench." The project performs regular benchmarks of the FreeBSD -current source tree using Unixbench and Super Smack, allowing you to chart the results over time. Erik highlighted an example of a visible change in performance when the generic kernel moved from the 4BSD scheduler to the ULE scheduler on October 19th, 2007.
Kris Kennaway responded favorably, then noted, "one suggestion I have is that as more metrics are added it becomes important for an 'at a glance; overview of changes so we can monitor for performance improvements and regressions among many workloads." He went on to suggest, "at some point the ability to annotate the data will become important (e.g. 'We understand the cause of this, it was r1.123 of foo.c, which was corrected in r1.124. The developer responsible has been shot.")" Erik agreed with both recommendations, and noted that he would continue to work in that direction.
"This patch speeds up e2fsck on Ext3 significantly using a technique called Metaclustering," stated Abhishek Rai. In an earlier thread he quantified this claim, "this patch will help reduce full fsck time for ext3. I've seen 50-65% reduction in fsck time when using this patch on a near-full file system. With some fsck optimizations, this figure becomes 80%." Most criticism so far has been in regards to formatting issues with the patch preventing it from being easily tested, resolved in the latest postings. It was also cautioned that the patch affects a significant amount of ext3 code, and thus will require very heavy testing. Abhishek described how the patch offers its significant gains for e2fsck:
"Metaclustering refers to storing indirect blocks in clusters on a per-group basis instead of spreading them out along with the data blocks. This makes e2fsck faster since it can now read and verify all indirect blocks without much seeks. However, done naively it can affect IO performance, so we have built in some optimizations to prevent that from happening. Finally, the benefit in fsck performance is noticeable only when indirect block reads are the bottleneck which is not always the case, but quite frequently is, in the case of moderate to large disks with lot of data on them. However, when indirect block reads are not the bottleneck, e2fsck is generally quite fast anyway to warrant any performance improvements."
"I've finally got some numbers to go along with the Btrfs variable blocksize feature. The basic idea is to create a read/write interface to map a range of bytes on the address space, and use it in Btrfs for all metadata operations (file operations have always been extent based)," explained Chris Mason in a recent posting to the Linux Filesystem Development mailing list. He linked to some benchmark results and summarized, "the first round of benchmarking shows that larger block sizes do consume more CPU, especially in metadata intensive workloads, but overall read speeds are much better." Chris then noted, "Dave reported that XFS saw much higher write throughput with large blocksizes, but so far I'm seeing the most benefits during reads." David Chinner replied, "the basic conclusion is that different filesystems will benefit in different ways with large block sizes...." explaining:
"Btrfs linearises writes due to it's COW behaviour and this is trades off read speed. i.e. we take more seeks to read data so we can keep the write speed high. By using large blocks, you're reducing the number of seeks needed to find anything, and hence the read speed will increase. Write speed will be pretty much unchanged because btrfs does linear writes no matter the block size.
"XFS doesn't linearise writes and optimises it's layout for a large number of disks and a low number of seeks on reads - the opposite of btrfs. Hence large block sizes reduce the number of writes XFS needs to write a given set of data+metadata and hence write speed increases much more than the read speed (until you get to large tree traversals)."
"I've never looked at the Reiser code though the comments I get from friends who use it are on the order of 'extremely reliable but not the fastest filesystem in the world'," Matt Dillon explained when asked to compare his new clustering HAMMER filesystem with ReiserFS, both of which utilize BTrees to organize objects and records. He continued, "I don't expect HAMMER to be slow. A B-Tree typically uses a fairly small radix in the 8-64 range (HAMMER uses 8 for now). A standard indirect block methodology typically uses a much larger radix, such as 512, but is only able to organize information in a very restricted, linear way." He continued to describe numerous plans he has for optimizing performance, "my expectation is that this will lead to a fairly fast filesystem. We will know in about a month :-)"
Among the optimizations planned, Matt explained, "the main thing you want to do is to issue large I/Os which cover multiple B-Tree nodes and then arrange the physical layout of the B-Tree such that a linear I/O will cover the most likely path(s), thus reducing the actual number of physical I/O's needed." He noted, "HAMMER will also be able to issue 100% asynchronous I/Os for all B-Tree operations, because it doesn't need an intact B-Tree for recovery of the filesystem." He went on to describe another potential optimization allowed by the filesystem's design, "HAMMER is designed to allow clusters-by-cluster reoptimization of the storage layout. Anything that isn't optimally layed-out at the time it was created can be re-layed-out at some later time, e.g. with a continuously running background process or a nightly cron job or something of that ilk. This will allow HAMMER to choose to use an expedient layout instead of an optimal one in its critical path and then 'fix' the layout later on to make re-accesses optimal."
"As you can see it in the graph, v2.6.23 schedules much more consistently too. [ v2.6.22 has a small (but potentially statistically insignificant) edge at 4-6 clients, and CFS has a slightly better peak (which is statistically insignificant)."
Ingo noted that he was nuable to find information as to how the other benchmark was generated, "there are no .configs or other testing details at or around that URL that i could use to reproduce their result precisely, so at least a minimal bugreport would be nice." He then offered some tips on how sysbench works and some suggested tunings, "sysbench is a pretty 'batched' workload: it benefits most from batchy scheduling: the client doing as much work as it can, then server doing as much work as it can - and so on. The longer the client can work the more cache-efficient the workload is. Any round-trip to the server due to pesky preemption only blows up the cache footprint of the workload and gives lower throughput."
"It looks to be about 2.1% increase in time to do the make/mount/unmount operations with the marker patches in place and no blktrace operations," Alan Brunelle summarized some benchmarks testing the overhead of the kernel markers patches. He continued, "with the blktrace operations in place we see about a 3.8% decrease in time to do the same ops." Block layer maintainer Jens Axboe responded favorably, "thanks for running these numbers. I don't think you have to bother with it more. My main concern was a performance regression, increasing the overhead of running blktrace." He added, "I'd say the above is Good Enough for me," acking the kernel marker patches.
Jens went on to muse, "I do wonder about that performance _increase_ with blktrace enabled. I remember that we have seen and discussed something like this before, it's still a puzzle to me..." Mathieu Desnoyers agreed, "interesting question indeed," going on to suggest possible future tests to understand the unexpected performance increase.
blktrace is a block layer IO tracing tool for providing detailed information about request queue operations, originally developed by Jens Axboe and merged into the mainline kernel in 2.6.17-rc1.
"I've just released the 2.6.23-rc9-ext4-1. It collapses some patches in preparation for pushing them to Linus, and adds some of the cleanup patches that had been incorporated into Andrew's broken-out-2007-10-01-04-09 series," announced Theodore Ts'o. He also noted of the current ext4 git tree, "it also has some new development patches in the unstable (not yet ready to push to mainline) portion of the patch series." In an earlier thread Theodore posted a series of patches specifically intended for inclusion in the upcoming 2.6.24 kernel. Included in the patch series was a patch for improving fsck performance, "in performance tests testing e2fsck time, we have seen that e2fsck time on ext3 grows linearly with the total number of inodes in the filesytem. In ext4 with the uninitialized block groups feature, the e2fsck time is constant, based solely on the number of used inodes rather than the total inode count." The patch included an explanation of how the feature works, enabled through a mkfs option:
"With this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation."
"By popular demand, here is release -v22 of the CFS scheduler. It is a full backport of the latest & greatest sched-devel.git code to v2.6.23-rc8, v188.8.131.52, v184.108.40.206 and v220.127.116.11," announced Ingo Molnar. He added, "this is the first time the development version of the scheduler has been fed back into the stable backport series, so there's many changes since v20.5". Ingo went on to explain, "even if CFS v20.5 worked well for you, please try this release too, with a good focus on interactivity testing - because, unless some major showstopper is found, this codebase is intended for a v2.6.24 upstream merge." He then summarized some of the changes:
"The changes in v22 consist of lots of mostly small enhancements, speedups, interactivity improvements, debug enhancements and tidy-ups - many of which can be user-visible. (These enhancements have been contributed by many people - see the changelog below and the git tree for detailed credits.)
"The biggest individual new feature is per UID group scheduling, written by Srivatsa Vaddagiri, which can be enabled via the CONFIG_FAIR_USER_SCHED=y .config option. With this feature enabled, each user gets a fair share of the CPU time, regardless of how many tasks each user is running."
"Lots of scheduler updates in the past few days, done by many people," noted Ingo Molnar, going on to describe the more significant updates. "Most importantly, the SMP latency problems reported and debugged by Mike Galbraith should be fixed for good now." Ingo noted that the current code base was looking stable and was likely to be merged into the upcoming 2.6.24 kernel, "so please give it a good workout and let us know if there's anything bad going on. (If this works out fine then i'll propagate these changes back into the CFS backport, for wider testing.)" He went on to describe the other main changes in the development branch of the process scheduler:
"I've also included the latest and greatest group-fairness scheduling patch from Srivatsa Vaddagiri, which can now be used without containers as well (in a simplified, each-uid-gets-its-fair-share mode). This feature (CONFIG_FAIR_USER_SCHED) is now default-enabled.
"Peter Zijlstra has been busy enhancing the math of the scheduler: we've got the new 'vslice' forked-task code that should enable snappier shell commands during load while still keeping kbuild workloads in check."