so right now the results dont seem to be too bad to me - the higher
overhead comes from two threads running on two different cores and
incurring the overhead of cross-core communications. In a true
spread-out workloads that synchronize occasionally you'd get the same
kind of overhead so in fact this behavior is more informative of the
real overhead i guess. In 2.6.21 the two threads would stick on the same
core and produce artificially low latency - which would only be true in
a real spread-out workload if all tasks ran on the same core. (which is
hardly the thing you want on openmp)
In any case, if i misinterpreted your numbers or if you just disagree,
or if have a workload/test that shows worse performance that it
could/should, let me know.
Ingo
--