Linux: Benchmarking Scheduling Latency

Submitted by Jeremy
on October 1, 2003 - 6:27am

Rick Lindsley recently uploaded some interesting comparisons of scheduler latency between 2.6.0-test5 [story], -test5-mm3 [story], -test6 [story], and -test6-mm1 [story] with four different benchmarks. The benchmarks used were Volanomark, SPECjbb, SPECdets, and the Kernbench kernel compilation scripts. His summmary looks at how performance has changed, how scheduler behavior has changed, and how these changes affect the different benchmarking tools used. He explains:

"High latency would usually indicate congested runqueues. High runslices generally indicates workloads that were cpu-bound. Different benchmarks have different "normal" behavior, however. Although results were gathered, most benchmarks were run in an abbreviated manner to see trends and characteristics rather than run full out, fully tuned, to get valid test results."

Rick summarizes, "Conclusion: test6 is generally as good as test5 unless you're running Volanomark -- then it's definitely worse." This is especially interesting as the Con Kolivas [interview] scheduler interactivity patches [forum] were merged into -test6. Read on for Rick's full explanation of the results.


From: Rick Lindsley [email blocked]
To:  linux-kernel
Subject: Scheduling latency summary
Date: Mon, 29 Sep 2003 17:45:06 -0700

I applied the schedstats patch to some recent releases and, with the help
of Steve Pratt, ran some benchmarks.  There's a lot of focus lately on
improving interactivity, and to me that seems directly related to how fast
a process can move from the run queue to the processor.  For this summary,
I'll call a "run slice" the period of time a task gets to run before it
voluntarily OR involuntarily leaves the processor.  "Latency" will be
the time between entering a runqueue and actually landing on a processor.

Using the schedstats patch, I took comparative measurements on -test5,
-test5-mm3, -test6, and -test6-mm1.  It's not only interesting to note
whether the benchmark improved, but how the scheduler behavior changed
(and differs between the different benchmarks).

High latency would usually indicate congested runqueues. High runslices
generally indicates workloads that were cpu-bound.  Different benchmarks
have different "normal" behavior, however.  Although results were
gathered, most benchmarks were run in an abbreviated manner to see trends
and characteristics rather than run full out, fully tuned, to get valid
test results.

Graphs can be viewed at http://eaglet.rain.com/rick/linux/schedstats/graphs/

Volanomark:
    test6-mm1 has, in general, about 15% higher latencies and about
    25% higher runslices than in test5.  Volanomark is known to
    be pathological with regards to repeatedly and quickly calling
    sched_yield at times with some implementations of Java.  The version
    I tested exhibits this spectacularly.  What's interesting to note is
    that it appears we're both waiting a bit longer to do the spinning
    as well as taking a bit longer to do it in test6-mm1.  Unlike most
    benchmarks, both run slices and latencies tend to live in the ns
    range, probably due to the rapid spinning.  These test results
    declined in test6 by over 5%.

SPECjbb
    As we move from small warehouses to larger warehouse
    runs we see us moving from low-latency/high-runslices to
    high-latency/low-runslices. Both test6 and test6-mm1 are showing
    about a 40% reduction in latency over test5, with only a slight
    reduction in runslice times (generally less than 5%).  Not surprisingly,
    test6 showed slightly better results when under heavier load.

SPECdets
    it's hard to see a pattern because the run utilized is generally short
    (under 5 minutes).  More frequent samples of the scheduler statistics
    might help.  In general, both test6 and test6-mm1 are comparable to
    test5 in terms of runslices and latencies.  Runslices are very small,
    generally less than 3ms, indicating these tasks do not run very long
    before leaving the processor. Test results showed slight degradation
    at the low end but slight improvement at the high end.

Kernbench
    We're all over the board, but basically no change. Both latencies and
    run slices tend to hover between 10 and 20 ms, suggesting moderate
    congestion but not major.  This can change depending on what -j you
    run make at.

Conclusion: test6 is generally as good as test5 unless you're running
Volanomark -- then it's definitely worse.

Rick

Related Links:

Interactivity comments

Con Kolivas
on
October 1, 2003 - 7:09am

Obviously I feel obliged to comment. It is nice to see that overall scheduling latency is virtually unchanged with all the scheduler changes. However this is not surprising really as the structure of the scheduler has not been signficantly modified. What this doesn't really tell us, though, is whether the latency of interactive tasks has decreased and cpu bound tasks has increased, which is what most of the changes are directed towards doing.

To create benchmarks to measure this you first have to define what an interactive task is, and then measure it's scheduling latency - insert long discussion about what an interactive task is here -. The point is, the scheduler tweaks have tried to determine what an interactive task is, and decrease the scheduling latency of these interactive tasks. So, if we use the same metrics used to find interactive tasks and then measured their scheduling latency it would be a self fulfilling benchmark showing that it worked (not much point in doing that). That leaves us with starting audio apps to see if they skip, and grabbing a window and dragging it across the screen to see if it moves smoothly.

Finally, the volanomark drop in performance is unimportant as it uses the sched_yield() command via the java run time environment which is going to be of detriment to performance where well coded apps would use futexes.

test6

Hiryu
on
October 1, 2003 - 10:01am

I got test6 working with my video card drivers last night so I was finally able to run a 2.6 kernel on my desktop.

Some ways it feels snappier but not completely. I notice that dragging _any_ window around the desktop is not as smooth as it could/should be.
As in, it's much less jerky in 2.4.

Con, anything I should keep an eye out for? I have dual cpu's, is that of any interest?

Multifactorial

Con Kolivas
on
October 1, 2003 - 5:24pm

The scheduling latency per runqueue is allowed to be more relaxed when there are more cpus because adding cpus tends to decrease latencies by having another cpu available. Even so, not everything is the cpu scheduler. Often major infrastructure changes elsewhere mean it's impossible to point to one thing only. The biggest giveaway that it isn't the cpu scheduler is if moving a window is still less smooth when the machine is otherwise completely idle, since X will be the only thing being scheduled cpu wise.

mouse sample rate?

Anonymous
on
October 2, 2003 - 3:12pm

Have you tried test5?

test6 decreased the default mouse sample rate, and that was very obvious to me (made *everything* done with the mouse feel jerky and strange).

Setting samplerate in XF86config didn't seem to have any effect, so I changed the default in the kernel to 200Hz instead and now my mouse is smooth. I think it can be changed with a simple kernel/module parameter but I didn't bother to find out how...

hardware - software mouse

Con Kolivas
on
October 2, 2003 - 4:57pm

Ah yes, very good point. The mouse changed from hardware to software and back again (or was it the other way around) during 2.6 development and this substantially changes the feel of mouse movement. This is also why some people find the mouse moves incredibly fast with some kernels. Check the recent lkml archives.

Kernel 2.6 mouse in Debian

Anonymous
on
December 18, 2003 - 1:27pm

To fix the fast mouse problem in Debian, run:

sudo dpkg-reconfigure xserver-xfree86

When it asks, "Please choose your mouse port", select "/dev/input/mice". Now it will work.

Previously, dpkg would configure two mice: one based on the answers you give it and the other always "/dev/input/mice". Kernel 2.6 pipes mouse data from, for example, PS/2 mice, to /dev/input/mice. Under the old configuration, this means the xserver is getting sent the mouse data twice, hence the fast mouse. If you select /dev/input/mice, dpkg won't make a second mouse.

why is audio playback considered a interactive task?

Anonymous
on
October 1, 2003 - 10:40am

Why do you consider audio playback a interactive task, and not a real-time task?

For me, it sounds much more real-time than interactive.
I don't interact with it, I _ear_ it, and because of the timing, rythmic, and temporal caracteristics of sound, when it skips or slows.. it sounds awfull. Basically what i'm trying to point out (badly) is that music and video players should anounce themselves has a soft real-time task/process, and use a real/time run-queue.

Interactivity is like, browsing the web, clicking the mouse button, moving the pointer, resizing/moving a window, changing desktop, scrolling a page... Delay of editing text, low latency using a IDE, text editor, a chat program, or something that involves user-input and machine output...

But, has you said, classifying what is and isn't interactive could become a holly war.

But I do feel that music and video should be classified has soft-real-time thingies.

So, the good way to make xmms not skip, would be, xmms telling(some syscall()?) the OS it is a real/time task, so that the OS(kernel) would put-it in a real-time run queue, instead of the regular generic run-queue...

Miguel Sousa Filipe
What's you opinion on this Con?
congrats for the good work on the scheduler tunning.

Interactive task

Con Kolivas
on
October 1, 2003 - 1:52pm

Why complain about semantics? Define interactive however you want. The scheduler is tuned to avoid long latencies in tasks where latency matters - whether you call them interactive... who cares?

Basically audio will have much lower latencies now, so they are very soft RR. Since audio apps wakeup at the most frequent every 50ms (I measured it), and when they do they simply dump some more data to the audio card, latencies of less than 10ms are more than adequate.

And thanks for the congrats :)

10msec in audio apps is not adequate !

Anonymous
on
October 4, 2003 - 6:18am

Hi Kevin,
please consider running my latencytest tool when analyzing scheduling latencies of 2.6.
Keep in mind that software synth programms need lower than 10msec latencies. around 3msec is considered "at par with professional hardware synths".
kernel 2.4 with the proper low latency patches is able to achieve 3msec latencies when using the latencytest benchmark.
It really simulates one of the worst cases that can happen when using a softsynth: you play lots of voices thus the CPU usage can go up to 80% and higher, while you could run some harddisc recording software that puts a high load on the disk subsystem. Eg reading and writing very large files form/to disk.
To the kernel folks: please make sure that the new 2.6 kernels perform well with latencytest using 3-4msec audio buffers. latencytest may be a crappy benchmark but it is really a real world simulation since those situations can happen when playing high cpu demanding softsynths.

this was generated on P133 with a 2.2.10 + lowlatency patch:
http://www.gardena.net/benno/linux/audio/2.2.10-p133-3x128/3x128.html

We audio folks need the same performance from the 2.6 kernel, otherwise Linux will not be able to compete in the real time audio field.

get the latencytest benchmark from here:
http://www.gardena.net/benno/linux/audio/

let me know

cheers,
Benno

10msec in audio apps is not adequate !

Anonymous
on
October 4, 2003 - 6:23am

I forgot to say this: of course all audio apps that need very low latencies (eg 3-5msec) run with SCHED_FIFO policy and I'm not sure if the scheduler algorithms of kernel 2.6 ignore all this "interactivity stuff" in that case and (I assume) will fall back to regular first in first out scheduling regardless if the task is marked as interactive or not.

Benno

RT scheduling different

Con Kolivas
on
October 4, 2003 - 7:32am

The real time scheduling in 2.6 is completely unaffected by any of the interactivity work. If you really need real time performance then you must use real time scheduling. The intrinsic latency of 2.6 using rt scheduling is better than 2.4 with preempt and low latency patches applied in virtually all settings. Check the lkml archives to see akpm's post about 2.5 latencies being lower than ever before.

The 10ms figure is for normal scheduling in the worst case scenario when multiple interactive tasks decide to use bursts of cpu at exactly the same time when they are at their highest boost level. Most of the time it will be less than 10ms.

Oh yeah and I've been called by the wrong name often since Con appears not to be used in America but never Kevin :-P. I think Constantine which is shortened to Con in Australia is shortened to Gus in America (I prefer Con).

Subjective benchmarking

Anonymous
on
October 1, 2003 - 12:37pm

Well, whatever the benchmarks may say, test6 is working beautifully for me.

My poor 500 Mhz laptop could hardly get xmms playing without skipping on 2.4 - when xmms was the only app running... Test5 was slightly better - but with test6 xmms doesn't skip a beat even with four compilations and updatedb running, while at the same time resizing mozilla like crazy... And under this load, browsing, writing etc. doesn't *feel* slower at all, even when top says 100% cpu use. It's almost magic.

So, huge thanks to Con (et al.)!

Applications partialy to blame ?

Anonymous
on
October 1, 2003 - 4:10pm

Finally, the volanomark drop in performance is unimportant as it uses the sched_yield() command via the java run time environment which is going to be of detriment to performance where well coded apps would use futexes.
I`ve actually asked this in another post but how much of all the "bad" behavour is caused at the application level?
An example is X, doesn`t this have it`s own sheduler? which seems like cruft to me(i think i remember reading that, maybe i was drunk :) ), i guess what i`m asking is..does the kernel really matter compared to the poor idioms/coding used at the application level?

Of course

Con Kolivas
on
October 1, 2003 - 5:16pm

Well that is absolutely true, and the most important part of interpreting benchmarks. This is why a generic solution in the kernel for poorly coded applications is a bad idea. If the developers of poorly coded userspace applications are made aware of their bad choices they can improve them. If the kernel just works around them (and this costs the kernel in other ways) we lose all the time. Good tools (eg futexes) to help good userspace coding are more important.

how is a good app coded

Anonymous
on
October 2, 2003 - 10:41am

currently i am working on some apps that don't use X (it uses fbdev/dri), i am using threads (nptl) and epoll for waiting for data, what is the best design for an app that needs the best response from the scheduler? i want that the user feels comfortable, am i on good track?

Some time ago i read something you wrote here (kernel_trap) about having a thread just doing i/o and other working with the data, is that valid currently?

thanks for answering Con.

-solca

Application coding

Con Kolivas
on
October 2, 2003 - 5:51pm

Alas I can't claim to be a good application coder (or even any sort of application coder). Fortunately you don't need to do anything specific for the scheduler to treat you as interactive (that was my job). All I can tell you is what I've seen done poorly from the scheduler's perspective. Using sched_yield() I've already mentioned - it's not a good way to go to sleep or wait unless you really don't care when your process gets cpu time again. The other mistake is being busy during waiting - especially waiting on some other task and polling it repeatedly with short time outs - this can lead to priority inversion if you are higher priority than the task you are waiting on. The worst cases we found during testing had select() timeouts of 15ms.

I'm pretty sure I didn't suggest a specific way of separating threads according to what they do, but it does make sense as the scheduler is more likely to treat each thread appropriately if it only does one type of thing.

One last thing is not to use fbcon which can cause very high latencies because it works with interrupts disabled. Read akpm's summary here: Things not to do.

Replacing sched_yield() ....

Anonymous
on
October 10, 2003 - 3:32am

I've recently written a small daemon using the NPTL under the latest RH9.0 kernel with the following architecture ...

TCP Input thread -> (one of n data q handlers) -> TCP Output thread

Where n is supposed to correspond to the number of CPU's in the system.
The issue I have is that to achieve low latency after the input queue has placed some data onto a queue it needs to call sched_yield() to let the q handlers to get a decent look in otherwise the input handler keeps reading data and filling the queues up to the q size limit including across system calls to the TCP sockets layer and only hands off to the q handlers when the q's are full and it's forced to wait on the q full condition variable.

The incoming data it typically bursty, from many different streams but also interactive so reasonable lenth q's are needed to allow the networking side of things to run at full steam - any advice?

one idea

Anonymous
on
October 10, 2003 - 3:58pm

use mutexes. They should be implemented using futexes, so be quite efficient. You'll get much more control over scheduling too.

Re: Interactivity comments

Rick Lindsley
on
October 3, 2003 - 3:26pm

While the structure of the scheduler has not been significantly modified, I think anyone who has modified the scheduler at all has found (sometimes the hard way) that the most trivial of modifications can create the most monsterous of changes -- and often not in the area you intended at all. So even small changes should be examined closely.

The purpose of introducing latency statistics was twofold: one was a new way to characterize popular benchmarks. "Are you trying to decrease latency? Then make sure you try your changes against benchmarks A and B, because they have high latencies too."

A second purpose was to try to put a quantity on what has otherwise been a very subjective measurement. "When I drag this window it seems slower", or "When I play this wav file, it skips a little more"
is a helpful observation but is too subjective to effectively demonstrate progress or regression. In addition, if it adversely affects some other workload, it's hard to say why. My theory is that interactive operations such as those *probably* suffer from too long a time passing between some event (a mouse click, or I/O completion) and their actually hitting the processor. Latency should measure that directly, and also allow us to see if new patches actually improved that. The patches I used to gather these statistics can also be applied per-process, and that capability may be very useful in directly measuring whether a patch aids xmms or X, and in turn choosing or developing a benchmark which mimics that.

All that said, what Con said about Volanomark is correct. Volanomark, in combination with some Java libraries, is known to utilize a pathological algorithm which involves grabbing and yielding the processor very quickly. Most people who work on the scheduler acknowledge this is unlike any other application, so while it is of note to benchmark runners, it probably is uninteresting to 99.9% of the other Linux users. The fact that his changes benefit one class of users while not hurting another class is important.

And ULE?

durdin
on
October 1, 2003 - 2:04pm

What about ULE sheduler? I used 2.6.0-test5 with ULE patch for 12 days and noticed a far more aggressive behaviour than with Con's patch set. At the same time ULE seemed to be generally more "kind" with "-10_niced_X-Server (tm)" while Con's sheduler spoiled the whole interactivity with that - it's the only "but". Now I use test5 + Con's patch set and XServer's priority set to 0, as it is recommended. My feelings about that? Well, I prefer Con's solution and find it as a right way to achieve smoothly working desktop. Good work Con!

The "nice" debate

Con Kolivas
on
October 1, 2003 - 2:16pm

Ah yes where would this discussion be without some comment about nice? Well as you're all aware, -nice values mean "do not be nice to tasks lower priority than me". The fact that the scheduler couldn't improve the performance previously without renicing X was a limitation, and the distributions doing this by default were really using ugly hacks. I don't understand why distributions should continue using ugly workarounds just because that's how it was done in the past. Whether you think audio that is ten nice values higher should still not skip or not is also semantics. If you never do it, and -nice is reserved for rare things like superuser trying to get access to a box that is under attack, it's not a problem.

The ULE scheduler patch that made it to linux was just a primitive form of the original port of what went from linux to freeBSD, modified there, and came back again (a port of a port). It had nothing like the testing and tuning that my patches did. I can't comment on FreeBSD interactivity fairly as I haven't tested it to the extent I've tested the linux interactivity.

Well, I'm still curious about

durdin
on
October 1, 2003 - 3:06pm

Well, I'm still curious about benchmarking the ULE. And, I don't care if it's just a primitive port of a port because it uses a different approach to accuratly choosing interactive processes - a simple and elegant solution may sometimes appear as primitive to us. What I'm afraid of is a vision of sheduler that perfectly suites most common tasks but messes around up in case of one stupid (or malicious) user setting all his (two thousand) processes to randomly selected 'nice' values.

Useful data

Con Kolivas
on
October 1, 2003 - 5:11pm

Oh yes I have no objection to and feel we should be benchmarking and measuring as many things as possible provided the data is interpreted appropriately. I was just explaining the status of the ULE patch.

I still don't see the need

Anonymous
on
October 3, 2003 - 10:03am

Why can't you people just use priorities for xmms, etc instead of adding more junk to the scheduler?

Tsk Tsk

Anonymous
on
October 3, 2003 - 1:48pm

Now now Nick, just because your patch was rejected doesn't mean you should go around trying to discredit Ingo and Con's work.

Hey!

Nick
on
October 3, 2003 - 4:47pm

That wasn't me. I think it would be crazy to have to renice xmms, considering it might use max 2% CPU on modern boxes.

You might be confusing that with my line that X should be reniced: if you want a process (X) to continually get 75%+ CPU and keep good scheduling latency while other processes are trying to run, nice is the obvious answer.

And... umm, most people seem pretty happy with Con's work, so thats great. I prefer my scheduler, and a couple of others might as well, so we use it. I was actually the one who was pushing to have Con's patches included in -linus!

Window manager scheduling

mp
on
October 5, 2003 - 10:36am

Con and others,

I'm new to the 2.6 kernel and overall, I must say that it is visibly faster and I'm quite happy with it. I have run into what must be a common problem that I have only seen described once on the LKML and I was wondering if anybody could comment on it.

It seems that keybindings in the Sawfish window manager are wildly erratic. Sometimes, I'll strike a key binding and the action will be performed immediatly. Othertimes, it may be as much as ten seconds before the action is performed. I'm on 2.6.0-test6 right now. Is this a scheduler issue? Is it something that would be best to take up with the Sawfish folks?

Keyboard issue?

Con Kolivas
on
October 5, 2003 - 4:48pm

I doubt very much that it's a cpu scheduler issue even if your machine is absolutely drowning under some sort of load at the time. There were some recent comments on lkml about keyboard repeat and other keyboard issues so maybe they're related.

Surprise: no surprises!

KiTaSuMbA
on
October 6, 2003 - 9:52pm

I've been using 2.6.0-test5-mm4 since day1 or 2 of its release (given that I actually have things to do, I just can't afford running behind every release). Before that I've been very faithful to Con's tree since .4.18. Although this cannot possibly be viewed as a "benchmark" or even an objective opinion, here are my impressions:
With all the talk about the Greatness of 2.6 I was expecting dramatic changes. I found out that this would be true for moving away from vanilla 2.4 kernels but not quite a surprise for someone used to pretty patched-up ones like the -ck series. However, I did notice a difference on when exactly the interactivity management goes critical and risks to disappoint you: while -ck ones seam to get most troubles during long and heavy disk I/Os the -test5-mm4 proves to be more "touchy" when the cpu gets near-100%. The latter is more noticeable during streaming multimedia (mplayer or xmms) while "strict" interactivity (poping up menus and mouse highlight over them) gives me the feel of simply UNBREAKABLE for the 2.6 kernel.
And know for some genuine Elitism (TM):
I agree that self-critic is absolutely healthy to get linux even better and "self-grooming" is not only useless but even pittyful but I can't help not relating to what I experienced the other day on a friend's box. Give windows XP a PITA job (like uncompressing a cd image) and you can kiss your machine goodbye until it finishes and go get some coffee! And, no, I don't consider an otherwise idle 2.5 P4 box a "low resources" one. So yes, with regards to desktop interactivity we are DEFINATELY there. Hoping that certain nuissances (like cdrecord not seeing my SCSI burner despite sg is loaded and the dev present - ouch!) get resolved soon enough, 2.6 is "straight to the point".

Cheers!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.