Linux news

Linux: 2.6.34-rc4, "Hunting A Really Annoying VM Regression"

Submitted by Jeremy
on April 12, 2010 - 10:16pm
Linux news

"It's been two weeks rather than the usual one, because we've been hunting a really annoying VM regression that not a lot of people seem to have seen, but I didn't want to release an -rc4 with it," began Linux creator Linus Torvalds, announcing the 2.6.34-rc4 Linux kernel. He explained, "we had the choice of either reverting all the anon-vma scalability improvements, or finding out exactly what caused the regression and fixing it. And we got pretty close to the point where I was going to just revert it all." Linus continued:

"Absolutely _huge_ kudos to Borislav Petkov who reported the problem and was able to not just reliably reproduce it, but also test new patches to try to narrow things down at a moments notice. The thing took ten days of emails flying back and forth, and Borislav was there all the time, day and night, through several patches that tried to fix it (several real bugs, but not the one he hit) and lots of patches to just add instrumentation to get us nearer to the cause of the problem. And finally, today, confirmation that we actually nailed the problem. So if anybody has been seeing a oops (or sometimes a GP fault) in page_referenced(), that should be gone now."

As for the rest of the changes, Linus noted, "the bulk of the changes come from drivers - a new network driver (cxgb4), but also updates to the radeon and nouveau drivers. And then there is the random updates everywhere." Read on for the full changelog.

Linux: Properly Creating And Testing Patches

Submitted by Jeremy
on April 9, 2010 - 1:18pm
Linux news

"If you're wondering why I'm taking a long time to respond to your patches,", began Theodore Ts'o on the linux-ext4 mailing list, in a thread that offered much insight into how and why to properly submit and test patches. "Patches that are accepted into mainline should do one and only one thing," Ted continued, "so if someone suggests that you make changes to your submitted patch, ideally what you should do is to resubmit the patch with the fixes --- and not submit a patch which is a delta to the previous one." He also noted that patch submitters often greatly outnumber maintainers dictating a higher standard of quality, "consider that for some maintainers, there may be 10 or 20 or 30 or more patch submitters in their subsystem. With that kind of submitter-to-maintainer ratio, the patch submitter simply has to do much more of the work, since otherwise the subsystem maintainer simply can't keep up."

Ted went on to acknowledge, "I happen to believe that we need to encourage newcomers to the kernel developer community, and so I spend more time mentoring people who are new to the process." He noted that his time was finite however, and that patches are accepted more quickly when they are easy to review and integrate. Regarding the filesystem for which the patches had been submitted, he added, "Ext4 is actually quite stable at this point. Very large numbers of people are using it, and most users are quite happy." For this reason, he pointed out that it is even more critical that the patches merged be of high quality. That said, he continued, "there is no such thing as code which is not buggy. For any non-trivial program, it's almost certain there are bugs. [...] Ext4 is not exempt from these fundamental laws of software engineering. 'Code is always buggy until the last user of the program dies'." He tied this back to the importance of testing patches before submitting, "keep in mind that the maxim that code which is not buggy also applies to your patches."

Linux: Memory Compaction

Submitted by Jeremy
on April 2, 2010 - 5:11pm
Linux news

Mel Gorman posted the seventh version of his Memory Compaction patches asking, "are there any further obstacles to merging?" The patches, first posted in May of 2007, provide a mechanism for moving GFP_MOVABLE pages into a smaller number of pageblocks, reducing externally fragmented memory. Mel explains that 'compaction' is another method of defragmenting memory, "for example, lumpy reclaim is a form of defragmentation as was slub 'defragmentation' (really a form of targeted reclaim). Hence, this is called 'compaction' to distinguish it from other forms of defragmentation."

The core compaction patch explains that memory is compacted in a zone by relocating movable pages towards the end of the zone:

"A single compaction run involves a migration scanner and a free scanner. Both scanners operate on pageblock-sized areas in the zone. The migration scanner starts at the bottom of the zone and searches for all movable pages within each area, isolating them onto a private list called migratelist. The free scanner starts at the top of the zone and searches for suitable areas and consumes the free pages within making them available for the migration scanner. The pages isolated for migration are then migrated to the newly isolated free pages."

Linux: Removing The Big Kernel Lock

Submitted by Jeremy
on April 1, 2010 - 7:52am
Linux news

Arnd Bergmann noted that he's working on removing the BKL from the Linux kernel, "I've spent some time continuing the work of the people on Cc and many others to remove the big kernel lock from Linux and I now have [a] bkl-removal branch in my git tree". He went on to explain that his branch is working, and lets him run the Linux kernel, "on [a] quad-core machine with the only users of the BKL being mostly obscure device driver modules." Arnd noted that this effort has a long history, "the oldest patch in this series is roughly eight years old and is Willy's patch to remove the BKL from fs/locks.c, and I took a series of patches from Jan that removes it from most of the VFS."

Arnd noted that his patch adds a global mutex to the TTY layer, which he called the 'Big TTY Mutex' and described as, "the basic idea here is to make recursive locking and the release-on-sleep explicit, so every mutex_lock, wait_event, workqueue_flush and schedule in the TTY layer now explicitly releases the BTM before blocking." Alan Cox suggested that this portion of the patch was best dropped for now, "it would be nice to get the other bits in first removing BKL from most of the kernel and building kernels which are non BKL except for the tty layer. That (after Ingo's box from hell has run it a bit) would reasonably test the assertion that the tty layer has no BKL requirements that are driven by [code] external to tty layer code." Andrew Morton suggested that the patches be pushed upstream to their appropriate maintainers for an additional sanity check, "Seems that there might be a few tricksy bits in here. Please do push at least the non-obvious parts out to the relevant people."

Linux: First Release Of nftables

Submitted by Jeremy
on April 1, 2009 - 2:05pm
Linux news

Netfilter maintainer Patrick McHardy recently announced a first alpha-release of nftables, slated to eventually replace iptables as the standard Linux packet filtering engine. Nftables aims to simplify the kernel ABI, reduce code duplication, improve error reporting, and provide more efficient execution, storage and updates of filtering rules. Patrick began with a high level overview of the three pieces that comprise the firewall, "the kernel provides a netlink configuration interface, as well as runtime ruleset evaluation using a small classification language interpreter. libnl contains the low-level functions for communicating with the kernel, the nftables frontend is what the user interacts with." An insightful overview can be found on lwn.net.

Patrick explained that data is represented internally in a generic fashion, "meaning it's possible to use any matching feature (ranges, masks, set lookups etc.) with any kind of data." He went on to add, "the kernel doesn't have a distinction between matches and targets anymore, operations can be arbitrarily chained, fixing a common complaint that multiple rules are required to f.i. log and drop a packet. Terminal operations will stop evaluation of a rule, even if further operations are specified." Speaking about the the userspace frontend, he noted, "the classification language is based on a real grammar that is parsed by a bison-generated parser (currently, it might have to be replaced) and converted to a syntax tree." Patrick continued, "the frontend supports both dealing with only a single rule at a time for incremental operations, as well as parsing entire files. In the later case verification is performed on all rules and changes are only made after full validation. Currently not implemented, but planned, is transactional semantic where changes are rolled back when the kernel reports an error."

2.6.27-rc8, "This One Should Be The Last One"

Submitted by Jeremy
on September 30, 2008 - 8:55am
Linux news

"So yet another week, another -rc," began Linux creator, Linus Torvalds, announcing the 2.6.27-rc8 Linux kernel. He continued, "this one should be the last one: we're certainly not running out of regressions, but at the same time, at some point I just have to pick some point, and on the whole the regressions don't look _too_ scary. And -rc8 obviously does fix more of them." Linus went on to note that most of the changes since -rc7 are small, "and there aren't even a whole lot of them."

Jiri Kosina cautioned that there is still an unknown bug affecting the e1000e driver currently in the 2.6.27 kernel, "rendering the cards unusable for most of the i-am-not-a-hacker users (and remember, even Dave Airlie bricked his laptop completely to death, when trying to restore eeprom contents)" When asked how to duplicate the bug, Jiri noted that the inability to reliably reproduce the bug added to the difficulty in debugging the problem, "apparently it is some kind of race, as it usually takes multiple cycles to trigger".

2.6.27-rc6, "Things Are Calming Down"

Submitted by Jeremy
on September 12, 2008 - 3:42pm
Linux news

"The patches most people hopefully care about tend to be small details," noted Linus Torvalds, announcing the 2.6.27-rc6 kernel. He continued, "and so more regressions should hopefully be closed now, some by just reverting the commits that caused breakage. I don't think anything special merits explicit comment, but you can get a flavor for things by scanning the appended shortlog." Earlier in the announcement email, Linus did note some specifics about which drivers caused the bulk of the patch:

"Same old deal - except it's been almost two weeks since -rc5. That said, the diff is actually about the same size, so I guess that means things are calming down. Most of the diff (bulk-wise) is updates to the new gspca (standard USB webcam) driver, although some of it is also removal of the dead miropcm20* driver."

Tux3 Acting Like A Filesystem

Submitted by Jeremy
on September 4, 2008 - 8:44am
Linux news

Daniel Phillips noted that his new Tux3 versioning filesystem is now operating like a filesystem, "the last burst of checkins has brought Tux3 to the point where it undeniably acts like a filesystem: one can write files, go away, come back later and read those files by name. We can see some of the hoped for attractiveness starting to emerge: Tux3 clearly does scale from the very small to the very big at the same time. We have our Exabyte file with 4K blocksize and we can also create 64 Petabyte files using 256 byte blocks." He went on to discuss some of the remaining features yet to be implemented, including atomic commits, versioning, coalesce on delete, a version of the filesystem written in the kernel, extents, locking, and extended attributes.

Reviewing the above list, Daniel decided he would work next on the coalesce on delete functionality, noting, "without this we can still delete files but we cannot recover file index blocks, only empty them, not so good." He added that at this time he was only going to focus on file truncation, "as soon as file truncation is added to the test mix we will see much more interesting behavior from the bitmap allocator, and we will discover some great ways to generate horrible fragmentation issues. Yummy." Daniel continued to point out that Tux3 is an open source project, and as such is always looking for others to participate, "whoever wants to carve their initials on what is starting to look like a for-real Linux filesystem, now is a great time to take a flyer. The code base is still tiny, builds fast, has lots of interactive feedback and is easy to work on. And you get to put your email address near the beginning of the list, which will naturally write its way into the history of open source. Probably."

2.6.27-rc5, Fixing Regressions

Submitted by Jeremy
on September 1, 2008 - 7:48am
Linux news

Linus Torvalds announced the 2.6.27-rc5 Linux Kernel, noting that his "weekly releases" tend to happen every eight days, adding, "the bulk of it is all config updates, and with arm and powerpc leading the pack." Linus continued:

"While the config updates amount to about three quarters of the diff, and if you don't use a rename-aware diff the blackfin include file movement pretty much accounts for the rest, hidden behind all those trivial (but bulky) changes are a lot of small changes that hopefully fix a number of regressions.

"The most exciting (well, for me personally - my life is apparently too boring for words) was how we had some stack overflows that totally corrupted some basic thread data structures. That's exciting because we haven't had those in a long time. The cause turned out to be a somewhat overly optimistic increase in the maximum NR_CPUS value, but it also caused some introspection about our stack usage in general. Including things like a patch to gcc to fix insane stack usage for vararg functions on x86-64. But that one would only hit anybody who was a bit too adventurous and selected the big 4096 CPU configuration. The rest of the regressions fixed are a bit more pedestrian."

2.6.27-rc4, "Random Stuff All Over"

Submitted by Jeremy
on August 25, 2008 - 2:10pm
Linux news

"Another week, another -rc," began Linus Torvalds, announcing the 2.6.27-rc4 Linux kernel, continuing, "this time the diffstat is almost totally dominated by the addition of the musb driver that drives the MUSB and TUSB controllers integrated into omap2430 and davinci. That, together with the removal of the auerswald USB driver (replaced by libusb version) is more than half of the bulk of the patch, and obviously most users won't ever notice." Linus added:

"Apart from those bulky USB updates, there's some arch updates (blackfin and ia64), network and input driver updates, and an XFS and UBIFS update. The rest is mostly random stuff all over, probably best described by the appended shortlog. A number of regressions should be off the table, but more remain..."

AXFS, Advanced Execute In Place Filesystem

Submitted by Jeremy
on August 21, 2008 - 8:10pm
Linux news

"I'd like to get a first round of review on my AXFS filesystem," began Jared Hulbert, describing his new Advanced XIP File System for Linux. XIP stands for eXecute-In-Place. The new filesystem received quite a bit of positive feedback. Jared offered the following description:

"This is a simple read only compressed filesystem like Squashfs and cramfs. AXFS is special because it also allows for execute-in-place of your applications. It is a major improvement over the cramfs XIP patches that have been floating around for ages. The biggest improvement is in the way AXFS allows for each page to be XIP or not. First, a user collects information about which pages are accessed on a compressed image for each mmap()ed region from /proc/axfs/volume0. That 'profile' is used as an input to the image builder. The resulting image has only the relevant pages uncompressed and XIP. The result is smaller memory sizes and faster launches."

64-bit Application Thread Creation Performance

Submitted by Jeremy
on August 18, 2008 - 12:51pm
Linux news

A recent discussion on the Linux Kernel mailing list noted that threaded 64-bit applications suffer a drastic slowdown in pthread_create performance when stack utilization goes above 4GB. Ingo Molnar offered an explanation of the problem, "unfortunately MAP_32BIT use in 64-bit apps for stacks was apparently created without foresight about what would happen in the MM when thread stacks exhaust 4GB. The problem is that MAP_32BIT is used both as a performance hack for 64-bit apps and as an ABI compat mechanism for 32-bit apps. So we cannot just start disregarding MAP_32BIT in the kernel - we'd break 32-bit compat apps and/or compat 32-bit libraries." The original report noted that once the shared stack goes above 4GB in size, thread creation can take as long as 10 milliseconds, a slowdown described as "quite unacceptable".

Ingo created a patch introducing a new MAP_STACK flag for glibc to be used instead of MAP_32BIT and avoid imposing the 32-bit performance limitation on threaded 64-bit applications. He noted, "glibc can switch to this new flag straight away - it will be ignored by the kernel." The new flag was quickly merged upstream, and changes were planned for glibc.

Tux3 Hierarchical Structure

Submitted by Jeremy
on August 14, 2008 - 7:04pm
Linux news

"It is about time to take a step back and describe what I have been implementing," began Daniel Phillips, referring to his new Tux3 filesystem. He provided a simple ASCII diagram that detailed the filesystem's hierarchical structure, describing each of the elements. About one he noted, "the volume table is a new addition not central to the goals of Tux3, but a nice feature to have given that it comes nearly for free. One Tux3 volume can have an arbitrary number of separate filesystems tucked inside it, indexed by a simple integer parameter at mount time. People say they like this idea and it imposes no significant complexity, so it goes in." Daniel continued:

"Each volume has a metablock pointing at the forward log chain for the volume, a version table that describes the hierarchical relationship between versions (snapshots), an atime table to take care of that horrid legacy Unix feature, and an inode table containing files and attributes of files. [...] Versioning takes place in three places, versioned pointers in the atime btree, versioned extents in a file data btree and versioned attributes in the inode table. [...] Notice the absence of a journal, the functionality of which is provided by forward log elements that I described in the Hammer thread (and will eventually write a separate post about)."

2.6.27-rc3, "Things Really _Have_ Calmed Down"

Submitted by Jeremy
on August 13, 2008 - 11:49am
Linux news

"Things really _have_ calmed down, and hopefully we've also resolved a lot of the regressions in -rc3," began Linus Torvalds, announcing the 2.6.27-rc3 Linux kernel. He noted that much of the patch size was from the inclusion of the new ath9k wireless driver, with much of the rest of the patch size due to the renaming of many arch include files in the ARM, AVR32 and m68lnommu architectures. Linus continued:

"All the small changes are where the regression fixes are, and other random improvements. And they're all over. The ShortLog (appended) probably gives a taste of it."

LVM Snapshot Merging

Submitted by Jeremy
on August 9, 2008 - 11:02am
Linux news

Mikulas Patocka announced new patches introducing snapshot merging for the Linux kernel's logical volume manager. He explained, "snapshot merging allows you to merge snapshot content back into the original device. The most useful use for this feature is the possibility to rollback [the] state of the whole computer after [a] failed package upgrade, [or an] administrator's error". The patches are for the 2.6.26 kernel, with device mapper 1.02.27 and LVM2.2.02.39.

Mikulas noted that there are three types of merges supported, --nameorigin, --namesnapshot, and --onactivate. The default merge method is --nameorigin, which can merge a snapshot into the origin volume, which can be mounted at any time after the merge starts. The --namesnapshot method merges into a snapshot, which can then be mounted. And the --onactive method schedules a merge to happen the next time the volume is activated, such as during a reboot. Mikulas noted, "this implementation of snapshot merging is meant to be stable, report any possible bugs to me."