Did some experimenting with gcc's -mtune option.
In particular, I wanted to see how well code scheduled
for pentium4 ran on non-p4 systems. I was quite surprised
that my mobile athlon-xp ate it up pretty well,
despite doing no alignment at all. (Due to the P4's
trace cache removes the decoding stage from the pipeline
for frequently executed code (such as that found in
inner loops of most benchmark type programs), loop
alignment isn't as important as it was on earlier Intel CPUs.
I did a further experiment using -falign-loops=16 -falign-jumps=16 -falign-functions=16
with -mtune=pentium4, expecting the Athlon to get noticably
better. Surprisingly, it didn't make that much difference.
Finally, I did a test with -mtune=athlon-xp, and got
slightly better results, but again, mostly lost in the noise.
Another set of experiments I ran on a 1.2GHz VIA C3 Nehemiah.
It seems to really not care what the code its running was
tuned for, as the results varied so slightly as to be
lost completely in the noise. Even moreso than it was
on the Athlon. So, given -mtune=pentium4 is a win for
P4's, and there's a lot of them out there, it's easy to see
why this was chosen as the default optimisation target
for Fedora RPMs. The kernel is a special case here.
Its 686 kernel is compiled with no special -mtune,
which makes it default to optimising instruction
scheduling for Pentium Pro. I might add a -mtune=pentium4
and see if I can coerce some better benchmark results
out of a 'tuned' kernel. The userspace results are
interesting enough to spend some time doing so.