Alan Cox

Linux: Removing The Big Kernel Lock

Submitted by Jeremy
on April 1, 2010 - 7:52am
Linux news

Arnd Bergmann noted that he's working on removing the BKL from the Linux kernel, "I've spent some time continuing the work of the people on Cc and many others to remove the big kernel lock from Linux and I now have [a] bkl-removal branch in my git tree". He went on to explain that his branch is working, and lets him run the Linux kernel, "on [a] quad-core machine with the only users of the BKL being mostly obscure device driver modules." Arnd noted that this effort has a long history, "the oldest patch in this series is roughly eight years old and is Willy's patch to remove the BKL from fs/locks.c, and I took a series of patches from Jan that removes it from most of the VFS."

Arnd noted that his patch adds a global mutex to the TTY layer, which he called the 'Big TTY Mutex' and described as, "the basic idea here is to make recursive locking and the release-on-sleep explicit, so every mutex_lock, wait_event, workqueue_flush and schedule in the TTY layer now explicitly releases the BTM before blocking." Alan Cox suggested that this portion of the patch was best dropped for now, "it would be nice to get the other bits in first removing BKL from most of the kernel and building kernels which are non BKL except for the tty layer. That (after Ingo's box from hell has run it a bit) would reasonably test the assertion that the tty layer has no BKL requirements that are driven by [code] external to tty layer code." Andrew Morton suggested that the patches be pushed upstream to their appropriate maintainers for an additional sanity check, "Seems that there might be a few tricksy bits in here. Please do push at least the non-obvious parts out to the relevant people."

Quote: History Is A One Way Street

Submitted by Jeremy
on August 13, 2008 - 2:29pm

"History is a one way street, and you might as well have the fs known the way it is so that people remember 'reiser oh wasn't he the guy who..' - unless you are trying to market the fs I guess."

Quote: This Wants Doing With A Hash Not A Prayer

Submitted by Jeremy
on August 6, 2008 - 1:19pm

"This wants doing with a hash not a prayer that '32 slots is enough'."

Proposing Read-Only ZFS

Submitted by Jeremy
on July 22, 2008 - 6:42pm
Linux news

A recent thread on the lkml discussed a blog entry stating that minimal ZFS support for GRUB was available under the GPL license, "we could now use that code to implement support for ZFS in the Linux kernel." Alan Cox explained, "no we can't. The GPL ZFS bits don't include the various methods that would violate the patent so there is no grant. I've several times asked Sun to simply give permission and they don't even answer. I can only read the Sun motivation one way - they want to look open but know that ZFS is about the only thing that might save Solaris as a product in the data centre so are not truly prepared to let Linus use it." H. Peter Anvin added, "from what I can see, it is an absolutely-minimal read only implementation."

Christoph Hellwig offered, "adding a read-only for the start zfs driver for Linux would be useful for various purposes. And adding read-only filesystems to Linux is really easy." Referring to the individual who started the discussion, he added, "if Fred really cares about it I'd be very happy to mentor him implementing it. It should be a very good learning exercise for him." When asked if this offer applied to anyone else, Christoph replied, "yes, this offer is of course up to everyone interested. But it's not purely an integration effort in the traditional sense, the grub filesystem interface is quite different from the Linux one, and the code structure and style is quite different. But if you're willing to learn it should be very interesting."

Removing the Big Kernel Lock

Submitted by Jeremy
on May 15, 2008 - 8:52am
Linux news

"As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptible BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptible code again. This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL. Ingo explained:

"This task is not easy at all. 12 years after Linux has been converted to an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ lock_kernel() critical sections and 800+ ioctls. They are spread out across rather difficult areas of often legacy code that few people understand and few people dare to touch. It takes top people like Alan Cox to map the semantics and to remove BKL code, and even for Alan (who is doing this for the TTY code) it is a long and difficult task."

Ingo went on to describe how the BKL works, how it differs from other locking mechanisms, and why this complicates removing it permanently from the kernel. He noted that the various dependencies of the lock are lost in the haze of 15 years of code changes, "all this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong." He then suggested "changing the rules of the game", creating a "kill-the-BKL" branch which "turns the BKL into an ordinary albeit somewhat big mutex, with a quirky lock/unlock interface called 'lock_kernel()' and 'unlock_kernel()'."

Quote: They Have All The Source Code But We Don't Have Theirs

Submitted by Jeremy
on March 26, 2008 - 7:41am

"Any problems beyond that point are ones you need to take up with Nvidia as they have all the source code but we don't have theirs."

Quote: If You Wrote Code, I'd Be Worried

Submitted by Jeremy
on January 17, 2008 - 6:24am

"I'm so glad you have nothing better to do than troll, if you actually wrote code I'd be worried it might get into something people used."

Quote: Repeatedly Posting Crud

Submitted by Jeremy
on January 2, 2008 - 11:41pm

"Repeatedly posting crud does not make it right."

Quote: Time For An -ac Tree Again

Submitted by Jeremy
on December 12, 2007 - 6:35am

"Must be time for an -ac tree again."

Quote: Sensible Defaults

Submitted by Jeremy
on November 20, 2007 - 9:14am

"Thats a very arrogant viewpoint. I don't have to be a TV engineer to use my television. Distributions should be providing sensible defaults out of the box. The kernel already provides them the mechanisms."

Quote: Very Very Questionable

Submitted by Jeremy
on November 12, 2007 - 10:33am

"I have difficulty constructing many scenarios where its useful but it appears valid providing you can tightly control file renaming, which is very very questionable."

Quote: Poor Security Can Be Worse Than No Security

Submitted by Jeremy
on October 25, 2007 - 9:13am

"There is a ton of evidence both in computing and outside of it which shows that poor security can be very much worse than no security at all. In particular stuff which makes users think they are secure but is worthless is very dangerous indeed."

Quote: I Don't Care About AppArmor

Submitted by Jeremy
on October 22, 2007 - 9:40pm

"Frankly I don't care about apparmor, I don't see it as a serious project. Smack is kind of neat but looks like a nicer way to specify selinux rules."

Virtually Debugging

Submitted by Jeremy
on October 15, 2007 - 10:42am
Linux news

"Incidentally i was thinking about using KVM for automated testing. Important pieces of hardware should get an in-KVM simulator/emulator, that way developers who do not own that hardware can do functionality testing too," Ingo Molnar suggested during a thread discussing a SCSI driver bug fix. Linus Torvalds was originally unimpressed by the idea:

"Using emulators to test device drivers is almost certain to be pointless. The problem with device drivers tends to be timing issues, odd hardware interactions, and lots of strange (and sometimes undocumented) behaviour and dependencies (eg things like 'you have to wait 50us after setting the reset bit until the hardware has actually reset'). These are all things that you'd generally not catch in emulation - because the emulation by necessity is only going to be a very weak picture of the real thing."

Alan Cox countered, "for some things. I do it a bit because you can use it to fake failures that are tricky to do in the real world. It won't tell you the driver works but its surprisingly good for testing for races (forcing IRQ delivery at specific points), buggy hardware you don't posses, and things like media failures and timeouts your real hardware refuses to do." Linus acquiesced conditionally, "I do agree that you likely find bugs, even if quite often it's exactly because the behaviour is something that will never happen on real hardware," then acknowledged previous debugging efforts by Alan, "but failure testing is very useful - I forget who it was who debugged some driver by taking a CD and just scratching it mercilessly to induce read errors ;)" Ingo added, "something like that wont enable 100% coverage (or even reasonable coverage for most hardware), so it's no replacement for actual hard testing, but it could push out the domain of minimally tested code quite a bit and increase the quality of the kernel."

Supporting More Partitions

Submitted by Jeremy
on October 8, 2007 - 7:01am
Linux news

"15 partitions (at least for sd_mod devices) are too few," Jan Engelhardt suggested along with a patch to try and make the mounting of an unlimited number of partitions possible. H. Peter Anvin proposed as an alternative, "now when we have 20-bit minors, can't we simply recycle some of the higher bits for additional partitions, across the board? 63 partitions seem to have been sufficient; at least I haven't heard anyone complain about that for 15 years."

Alan Cox explained, "this was proposed ages ago. Al Viro vetoed sparse minors and it has been stuck this way ever since. If you have > 15 partitions use device mapper for it. I'd prefer it fixed but it's arguable that device mapper is the right way to punt all our partitioning to userspace".