With his recent release of 2.5.59-mm5, Andrew Morton [interview] explained the anticipatory I/O scheduler that he has recently merged from Nick Piggin. The resulting thread is an informative read, explaining how and why this noticeably increases performance.
Much of the benefit gained from anticipatory I/O scheduling is from the fact that read operations are usually synchronous, meaning that the first read has to happen and report back before the next read is scheduled. Without this recent patch, a command requiring multiple synchronous reads gets each of them scheduled at the back of the queue one at a time, resulting in noticeably poor performance during streamed writes. With this patch, the reads are moved toward the front of the queue, and a few millisecond pause is actually added "anticipating" the next read request. Andrew's full explanation follows, going into much more detail.
From: Andrew Morton To: linux-kernel, linux-mm Subject: 2.5.59-mm5 Date: Thu, 23 Jan 2003 19:50:44 -0800 http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/ . -mm3 and -mm4 were not announced - they were sync-up patches as we worked on the I/O scheduler. . -mm5 has the first cut of Nick Piggin's anticipatory I/O scheduler. Here's the scoop: The problem being addressed here is (mainly) kernel behaviour when there is a stream of writeout happening, and someone submits a read. In 2.4.x, the disk queues contain up to 30 megabytes of writes (say, one seconds's worth). When a read is submitted the 2.4 I/O scheduler will try to insert that at the right place between the writes. Usually, there is no right place and the read is appended to the queue. That is: it will be serviced in one second. But the problem with reads is that they are dependent - neither the application nor the kernel can submit read #N until read #N-1 has completed. So something as simple as cat /usr/src/linux/kernel/*.c > /dev/null requires several hundred dependent reads. And in the presence of a streaming write, each and every one of those reads gets stuck at the end of the queue, and takes a second to propagate to the head. The `cat' takes hundreds of seconds. The celebrated read-latency2 patch recognises the fact that appending a read to a tail of writes is dumb, and puts the read near the head of the queue of writes. It provides an improvement of up to 30x. The deadline I/O scheduler in 2.5 does the same thing: if reads are queued up, promote them past writes, even if those writes have been waiting longer. So far so good, but these fixes are still dumb. Because we're solving the dependent read problem by creating a seek storm. Every time someone submits a read, we stop writing, seek over and service the read, and then *immediately* seek back and start servicing writes again. But in the common case, the application which submitted a read is about to go and submit another one, closeby on-disk to the first. So whoops, we have to seek back to service that one as well. So what anticipatory scheduling does is very simple: if an application has performed a read, do *nothing at all* for a few milliseconds. Just return to userspace (or to the filesystem) in the expectation that the application or filesystem will quickly submit another read which is closeby. If the application _does_ submit the read then fine - we service that quickly. If it does not submit a read then we lose. Time out and go back to doing writes. The end result is a large reduction in seeking - decreased read latency, increased read bandwidth and increased write bandwidth. The code as-is has rough spots and still needs quite some work. But it appears to be stable. The test which I have concentrated on is "how long does my laptop take to compile util-linux when there is a continuous write happening". On ext2, mounted noatime: 2.4.20: 538 seconds 2.5.59: 400 seconds 2.5.59-mm5: 70 seconds No streaming write: 48 seconds A couple of VFS changes were needed as well. More details on anticipatory scheduling may be found at http://www.cs.rice.edu/~ssiyer/r/antsched/ Changes since 2.5.59-mm2: +preempt-locking.patch Speed up the smp preempt locking. +ext2-allocation-failure-fix.patch ext2 ENOSPC crash fix +ext2_new_block-fixes.patch ext2 cleanups +hangcheck-timer.patch A form of software watchdog +slab-irq-fix.patch Fix a BUG() in slab when memory exhaustion happens at a bad time. +sendfile-security-hooks.patch Reinstate lost security hooks around sendfile() +buffer-io-accounting.patch Fix IO-wait acounting +aic79xx-linux-2.5.59-20030122.patch aic7xxx driver update +topology-remove-underbars.patch cleanup +mandlock-oops-fix.patch file locking fix +reiserfs_file_write.patch reworked reiserfs write code. -exit_mmap-fix2.patch Dropped +generic_file_readonly_mmap-fix.patch Fix MAP_PRIVATE mmaps for filesystems which don't support ->writepage() +seq_file-page-defn.patch Compile fix +exit_mmap-fix-ppc64.patch +exit_mmap-ia64-fix.patch Fix the exit_mmap() problem in arch code. +show_task-fix.patch Fix oops in show_task() +scsi-iothread.patch software suspend fix +numaq-ioapic-fix2.patch NUMAQ stuff +misc.patch Random fixes +writeback-sync-cleanup.patch remove some junk from fs-writeback.c +dont-wait-on-inode.patch Fix large delays in the writeback path +unlink-latency-fix.patch Fix large delays in unlink() +anticipatory_io_scheduling-2_5_59-mm3.patch Anticipatory scheduling implementation All 65 patches: kgdb.patch devfs-fix.patch deadline-np-42.patch (undescribed patch) deadline-np-43.patch (undescribed patch) setuid-exec-no-lock_kernel.patch remove lock_kernel() from exec of setuid apps buffer-debug.patch buffer.c debugging warn-null-wakeup.patch reiserfs-readpages.patch reiserfs v3 readpages support fadvise.patch implement posix_fadvise64() ext3-scheduling-storm.patch ext3: fix scheduling storm and lockups auto-unplug.patch self-unplugging request queues less-unplugging.patch Remove most of the blk_run_queues() calls lockless-current_kernel_time.patch Lockless current_kernel_timer() scheduler-tunables.patch scheduler tunables htlb-2.patch hugetlb: fix MAP_FIXED handling kirq.patch kirq-up-fix.patch Subject: Re: 2.5.59-mm1 ext3-truncate-ordered-pages.patch ext3: explicitly free truncated pages prune-icache-stats.patch add stats for page reclaim via inode freeing vma-file-merge.patch mmap-whitespace.patch read_cache_pages-cleanup.patch cleanup in read_cache_pages() remove-GFP_HIGHIO.patch remove __GFP_HIGHIO quota-lockfix.patch quota locking fix quota-offsem.patch quota semaphore fix oprofile-p4.patch oprofile_cpu-as-string.patch oprofile cpu-as-string preempt-locking.patch Subject: spinlock efficiency problem [was 2.5.57 IO slowdown with CONFIG_PREEMPT enabled) wli-11_pgd_ctor.patch (undescribed patch) wli-11_pgd_ctor-update.patch pgd_ctor update stack-overflow-fix.patch stack overflow checking fix ext2-allocation-failure-fix.patch Subject: [PATCH] ext2 allocation failures ext2_new_block-fixes.patch ext2_new_block cleanups and fixes hangcheck-timer.patch hangcheck-timer slab-irq-fix.patch slab IRQ fix Richard_Henderson_for_President.patch Subject: [PATCH] Richard Henderson for President! parenthesise-pgd_index.patch Subject: i386 pgd_index() doesn't parenthesize its arg sendfile-security-hooks.patch Subject: [RFC][PATCH] Restore LSM hook calls to sendfile macro-double-eval-fix.patch Subject: Re: i386 pgd_index() doesn't parenthesize its arg mmzone-parens.patch asm-i386/mmzone.h macro paren/eval fixes blkdev-fixes.patch blkdev.h fixes remove-will_become_orphaned_pgrp.patch remove will_become_orphaned_pgrp() buffer-io-accounting.patch correct wait accounting in wait_on_buffer() aic79xx-linux-2.5.59-20030122.patch aic7xxx update MAX_IO_APICS-ifdef.patch MAX_IO_APICS #ifdef'd wrongly dac960-error-retry.patch Subject: [PATCH] linux2.5.56 patch to DAC960 driver for error retry topology-remove-underbars.patch Remove __ from topology macros mandlock-oops-fix.patch ftruncate/truncate oopses with mandatory locking put_user-warning-fix.patch Subject: Re: Linux 2.5.59 reiserfs_file_write.patch Subject: reiserfs file_write patch vmlinux-fix.patch vmlinux fix smalldevfs.patch smalldevfs sound-firmware-load-fix.patch soundcore.c referenced non-existent errno variable generic_file_readonly_mmap-fix.patch Fix generic_file_readonly_mmap() seq_file-page-defn.patch Include in fs/seq_file.c, as it uses PAGE_SIZE exit_mmap-fix-ppc64.patch exit_mmap-ia64-fix.patch Fix ia64's 64bit->32bit app switching show_task-fix.patch Subject: [PATCH] 2.5.59: show_task() oops scsi-iothread.patch scsi_eh_* needs to run even during suspend numaq-ioapic-fix2.patch NUMAQ io_apic programming fix misc.patch misc fixes writeback-sync-cleanup.patch dont-wait-on-inode.patch unlink-latency-fix.patch anticipatory_io_scheduling-2_5_59-mm3.patch Subject: [PATCH] 2.5.59-mm3 antic io sched
From: Alex Bligh Subject: Re: 2.5.59-mm5 Date: Fri, 24 Jan 2003 11:03:28 -0000 --On 23 January 2003 19:50 -0800 Andrew Morton wrote: > So what anticipatory scheduling does is very simple: if an application > has performed a read, do *nothing at all* for a few milliseconds. Just > return to userspace (or to the filesystem) in the expectation that the > application or filesystem will quickly submit another read which is > closeby. I'm sure this is a really dumb question, as I've never played with this subsystem, in which case I apologize in advance. Why not follow (by default) the old system where you put the reads effectively at the back of the queue. Then rather than doing nothing for a few milliseconds, you carry on with doing the writes. However, promote the reads to the front of the queue when you have a "good lump" of them. If you get further reads while you are processing a lump of them, put them behind the lump. Switch back to the putting reads at the end when we have done "a few lumps worth" of reads, or exhausted the reads at the start of the queue (or perhaps are short of memory). IE (with a "lump" = 20) and "a few" = 3. W0 W1 W2 ... W50 W51 [Read arrives, we process some writes] W5 ... W50 W51 R0 [More reads arrive, more writes processed] W10 ... W50 W51 R0 R1 R2 .. R7 [Haven't got a big enough lump, but a write arrives] W12 W13... W50 W51 W52 R0 R1 R2 .. R7 [More reads arrive, more writes processed] W14 W15 ... W50 W51 W52 R0 R1 R2 .. R7 R8 R9.. R19 [Another read arrives, after 4 more writes have been processed, and we move the lump to the front] R0 R1 R2 .. R7 R8 R9.. R19 R20 W18 W19 ... W50 W51 W52 [Some reads are processed, and some more arrive, which we insert into our lump at the front] R0 R1 R2 .. R7 R8 R9.. R19 R20 R21 R22 W18 W19 ... W50 W51 W52 Then either if the reads are processed at the front of the queue faster than they arrive, and the "lump" disappears, or if we've processed 3 x 20 = 60 reads, we revert to sticking reads back at the end. All this does is lump between 20 and 60 reads together. The advantage being that you don't "do nothing" for a few milliseconds, and can attract larger lumps, than by waiting without incurring additional latency. Now of course you have the ordering problem (in that I've assumed you can insert things into the queue at will), but you have that anyway. -- Alex Bligh
From: Andrew Morton Subject: Re: 2.5.59-mm5 Date: Fri, 24 Jan 2003 03:16:32 -0800 Alex Bligh wrote: > > > > --On 23 January 2003 19:50 -0800 Andrew Morton wrote: > > > So what anticipatory scheduling does is very simple: if an application > > has performed a read, do *nothing at all* for a few milliseconds. Just > > return to userspace (or to the filesystem) in the expectation that the > > application or filesystem will quickly submit another read which is > > closeby. > > I'm sure this is a really dumb question, as I've never played > with this subsystem, in which case I apologize in advance. > > Why not follow (by default) the old system where you put the reads > effectively at the back of the queue. Then rather than doing nothing > for a few milliseconds, you carry on with doing the writes. However, > promote the reads to the front of the queue when you have a "good > lump" of them. That is the problem. Reads do not come in "lumps". They are dependent. Consider the case of reading a file: 1: Read the directory. This is a single read, and we cannot do anything until it has completed. 2: The directory told us where the inode is. Go read the inode. This is a single read, and we cannot do anything until it has completed. 3: Go read the first 12 blocks of the file and the first indirect. This is a single read, and we cannot do anything until it has completed. The above process can take up to three trips through the request queue. In this very common scenario, the only way we'll ever get "lumps" of reads is if some other processes come in and happen to want to read nearby sectors. In the best case, the size of the lump is proportional to the number of processes which are concurrently trying to read something. This just doesn't happen enough to be significant or interesting. But writes are completely different. There is no dependency between them and at any point in time we know where on-disk a lot of writes will be placed. We don't know that for reads, which is why we need to twiddle thumbs until the application or filesystem makes up its mind.
From: Alex Tomas Subject: Re: 2.5.59-mm5 Date: 24 Jan 2003 14:23:58 +0300 >>>>> Andrew Morton (AM) writes: AM> But writes are completely different. There is no dependency AM> between them and at any point in time we know where on-disk a lot AM> of writes will be placed. We don't know that for reads, which is AM> why we need to twiddle thumbs until the application or filesystem AM> makes up its mind. it's significant that application doesn't want to wait read completion long and doesn't wait for write completion in most cases.
From: Andrew Morton Subject: Re: 2.5.59-mm5 Date: Fri, 24 Jan 2003 03:50:17 -0800 Alex Tomas wrote: > > >>>>> Andrew Morton (AM) writes: > > AM> But writes are completely different. There is no dependency > AM> between them and at any point in time we know where on-disk a lot > AM> of writes will be placed. We don't know that for reads, which is > AM> why we need to twiddle thumbs until the application or filesystem > AM> makes up its mind. > > > it's significant that application doesn't want to wait read completion > long and doesn't wait for write completion in most cases. That's correct. Reads are usually synchronous and writes are rarely synchronous. The most common place where the kernel forces a user process to wait on completion of a write is actually in unlink (truncate, really). Because truncate must wait for in-progress I/O to complete before allowing the filesystem to free (and potentially reuse) the affected blocks. If there's a lot of writeout happening then truncate can take _ages_. Hence this patch: Truncates can take a very long time. Especially if there is a lot of writeout happening, because truncate must wait on in-progress I/O. And sys_unlink() is performing that truncate while holding the parent directory's i_sem. This basically shuts down new accesses to the entire directory until the synchronous I/O completes. In the testing I've been doing, that directory is /tmp, and this hurts. So change sys_unlink() to perform the actual truncate outside i_sem. When there is a continuous streaming write to the same disk, this patch reduces the time for `make -j4 bzImage' from 370 seconds to 220. namei.c | 12 ++++++++++++ 1 files changed, 12 insertions(+) diff -puN fs/namei.c~unlink-latency-fix fs/namei.c --- 25/fs/namei.c~unlink-latency-fix 2003-01-24 02:41:04.000000000 -0800 +++ 25-akpm/fs/namei.c 2003-01-24 02:47:36.000000000 -0800 @@ -1659,12 +1659,19 @@ int vfs_unlink(struct inode *dir, struct return error; } +/* + * Make sure that the actual truncation of the file will occur outside its + * diretory's i_sem. truncate can take a long time if there is a lot of + * writeout happening, and we don't want to prevent access to the directory + * while waiting on the I/O. + */ asmlinkage long sys_unlink(const char * pathname) { int error = 0; char * name; struct dentry *dentry; struct nameidata nd; + struct inode *inode = NULL; name = getname(pathname); if(IS_ERR(name)) @@ -1683,6 +1690,9 @@ asmlinkage long sys_unlink(const char * /* Why not before? Because we want correct error value */ if (nd.last.name[nd.last.len]) goto slashes; + inode = dentry->d_inode; + if (inode) + inode = igrab(inode); error = vfs_unlink(nd.dentry->d_inode, dentry); exit2: dput(dentry); @@ -1693,6 +1703,8 @@ exit1: exit: putname(name); + if (inode) + iput(inode); /* truncate the inode here */ return error; slashes:
All Hail the Kernel Gods!
I Love Linux! But really though, 2.6 is looking to be a kick ass release. Most of the areas where 2.4 was weak are being fixed. Wonderful stuff!
different idea
seems to me the only way to avoid the seek storm and idling while waiting for the next read is to read ahead using something along the lines of a processor's branch prediction: from what i know of filesystems the _realistic_ choices for what is supposed the be read next (i.e. after a read()) are usually rather limited.
now that kind of prediction might waste more time than idling, and it'd be difficult to finetune how much to read ahead, but _i_ certainly know too little about the kernel code to decide how errr... uninformed my idea was :)
Some problems
Yeah you have a couple of problems here.
First is memory consumption. Reads fill memory. This means you can't do massive readahead.
Second is predicting reads. For non sequential files this really is _very_ difficult, and trying to do accurate prediction is an exercise in futility. Even applications themselves often have little idea about what they will read next. Processors at least have run the path before and have the knowledge the code hasn't changed.
memory consumption & prediction
yup, while i'd personally think of memory as a less valuable resource than time, i completely agree (as i mentioned) that prediction might waste more time than idling. though i hate having ideas dashed, i'm glad to have that suspicion confirmed, thanks (also to riel for pointing out times).
i'm currently digging my way into the kernel code, and while the code is easy enough to read, understanding which considerations went into the design is another matter altogether.
therefore i assumed that reading ahead like that might result in better performance in the same way as memory pooling (like apr does as opposed to massive amounts of subsequent malloc()s and free()s), or blocked reading (at the expense of some memory) might speed things up.
obviously at the kernel level, such things are influenced by more hardware-specific matters, of which, alas, i know too little.
Memory resource...
Well memory actually becomes a concern when thinking about servers especially. Say an FTP server. This is one problem that happened in 2.4 sometime.
You have say 1000 clients reading at 15K/s, 15MB/s. Thats not hard, but say most are reading different files, so to get decent streaming performance out of a disk in this situation, you'd like probably 1MB readahead for each file.
(At 200K readahead for each stream, a 20MB/s disk with 10ms avg seek time will spend half its time seeking - 10MB/s, not enough, 1MB gives you 17MB/s).
This means at any time you could have 1GB of readahead outstanding. In practice maybe 750MB. Once you get a few more clients and run into some memory pressure your readahead starts getting discarded and you start running into all sorts of trouble.
Anticipatory IO scheduling can get as good numbers with say 32KB readahead per file (~20MB readahead outstanding).
Thats how the theory goes, anyway.
who reads ahead here?
I'm just curious... in the FTP server example you mentioned above, you say that a readahead of a particular size per file would be maintained on the server... who does the read ahead though? I'm assuming the FTP server would, because the kernel would have no idea what the application (ftpd) is attempting to achieve. And if the FTP server is doing the read-ahead, can you configure how much readahead goes on?
The presumption is that the k
The presumption is that the kernel does the readahead, assuming the file is sent sequentially. Indeed, isn't this the type of application (FTP/HTTP server) for which sendfile() was invented?
Yep kernel
Yes the kernel has pretty good mechanisms for detecting sequential IO. This is about the only common read load you can detect and plan for ahead of time.
Re: Yep Kernel
Yes the kernel has pretty good mechanisms for detecting sequential IO. This is about the only common read load you can detect and plan for ahead of time.
Actually, the kernel currently has no mechanism for detecting sequential I/O.
mm/readahead.c
...
idling is no problem
Lets take your run-of-the-mill 7200 RPM disk. This disk turns 120 times a second, meaning you have on average 4.2 milliseconds rotational latency and a seek time of 7 to 10 ms average, less for nearby tracks. This means that reading far-away data should take at least 15ms, while reading nearby data should take at most 8ms.
CPUs, on the other hand, are extremely fast. For most small reads it should be possible to return the follow-up read within 1 millisecond. This means that anticipatory scheduling with idle time will in the very worst case make the disk a few percent slower, measurably slower, but not noticably slower. In the best case (which is also the normal case) it will make the workload faster.
idling might still be a problem?
True, this will be an improvement to the current situation. But the readahead idea still has me wondering if we can improve even more:
In the case that the next read will be directly adjacent, the CPU issues the next read after 1 millisecond. The disk still has to finish the rotation it just started, though, so we loose 4.2 milliseconds, more on notebooks.
If we instead read another n pages and add them as the least recently used, neither the disc, the CPU or the memory will pay much. But the user gains 3.2 milliseconds on the next read, ~75%, quite impressive.
Ok, now tell me why I am braindead and this trick doesn't work. ;-)
you are right and wrong
You are right in analysis, but actually 1, we _do_ do some readahead, and 2, drives have a track buffer so when it gets a read request it will read the entire track that sector is on.
What I'm curious about is...
How about the occasional case of non-dependent reads? If we're streaming a file out of an FTP server (eg. w/ sendfile or similar), or requesting a bunch of reads via an aio interface, we don't have the dependent-read problem. Given that, does it make sense to treat these reads in the traditional fashion, and only perform the idling trick for reads that directly resulted in the task blocking?
Also, would there be any value to letting the task hint when a given read is the last read in a chain of dependent reads?
My guess, from the comment above, is that the most you'd save is that 'few percent'. Thoughts?
We do
we do have the dependant read problem. Readahead might mask it but we still have to send reads in a specific order. There can be non dependant reads however. We have a mechanism for detecting them and stopping anticipation if we get any. It just isn't in mm yet!
Nick
are they ignoring Knuth?
I know DE Knuth is slightly out of date, but he showed that an adaptive double buffer algorithm can do O(log n) reads even for non-predictive data. Under some circumstances, anticipation looks like it could improve performance, and I've done some simulations which suggest it does.
However, I think there are some race-conditions which could cause anticipatory lockups under heavy read conditions (eg - multithreaded databse queries).
What is the algorithm?
What is Knuth's algorithm? And I don't know what you are referring to that is O(ln N), but its not the algorithm cost that we're worried about, its being able to properly predict reads and service them without using a lot of memory.
Also what reace conditions do you think there are?
this sounds great but...
This looks like an interesting and pragmatic way to solve the problem, it's counter-intuitive to me but it makes sense and the benchmarks are in agreement. My only question is, does this (once all the rough spots and problems are worked out) have any chance of making it into 2.5 mainline?
Yes
You could say it is a new feature, but it is also a performance bugfix to the code which is already in mainline. It is having some obscure problems at the moment, but once it is very well tested, debugged and benchmarked, I think Linus would be mad not to. I think for a lot of disk bound workloads, it will be the single biggest performance in 2.5
Some disperse thoughts
I suppose the process wich sent the request must continue running during that "do *nothing at all* for a few milliseconds" (otherwise, how could it send another request?); i.e., the kernel has to tell the process that the data has been read and is available for use (even if it is not).
Then, what happens if that process tries to use such data inmediately? Perhaps there is some kind of page-fault that blocks the process until the pending read has really ended?
Now, I do not understand the "few milliseconds" pause. Why cannot the kernel just enqueue the request and let the process continue?. This could be somethink like asynchronous reads for synchronous processes, where synchronizations happens at kernel level and only when data is strictly needed.
Just some disperse thoughts from a newbe, please excuse me if I am too naive.
You got the order of things w
You got the order of things wrong. First the kernel reads the requested block and delivers it to the process. Then the disk does nothing (the kernel doesn't send any more requests) for a few milliseconds, while the application processes the data, and figures out that it needs another block. If it does, the head is already in place, and the data is read very fast. If it doesn't need another block, I/O has been delayed a minimal amount of time, much less than the time saved in the first case.
So, on average, time is saved, and the disk needs fewer seeks, thus reducing the wear on the drive mechanics at the same time.
What about reading changed data?
I'm not even experienced enough to call myself a kernel newbie, but one thing jumps out at me:
When moving the reads to the beginning of the queue, instead of appending them to the end, what happens when one of the writes in the queue will change the data that's being read?
Not an issue.
I'm pretty sure the read won't get scheduled in that manner. All the read I/O requests are for accesses that missed the block cache. All the write requests presumably are for blocks being purged from the block cache. As I understand it, blocks queued for writeout aren't removed from the block cache until after they're written out. Therefore, a read request that comes in for a block that happens to be queued for writeout would still be a "hit", and thus would not be inserted on the I/O queue.
Many thanks
Now everything makes sense for me.
This might deteriorate performance for databases
In a database that does mostly reads, I think the suggested scheme will deteriorate performance. In such a database, you will have several reads in the request queue at any one time, and the application will usually know the amount of data it needs to read. There is no correlation between the location of the next read by a thread and the previous read, so you could just as well start servicing a completely different read as opposed to waiting for the next read from that process.
What I would like to see is a way to detect this scenario. Two features of a system where the proposed scheme would not work are:
In general I think the scheme would work, but it has to detect when the scheme does not work.
Just because you put reads on
Just because you put reads on the "fast track" to the front of the queue and you hold off a stream of writes does not mean you disable the elevator for reads. It just means that writes are temporarily throttled when a random read request comes in.
Reads starve writes
Normally writes being much slower than reads cause starvation of the reads. With the anticipatory scheduler my first set of contest results (sent to lkml) show that for the first time, reads are progressing so unabated that writes are now being starved. Having said that, the anticipatory i/o definitely does virtually fix the problem of writes starving reads, so I assume the next step needs to be taken now to prevent the new problem.