Re: 2,6.26-rc4-git2 - long pause during boot

Previous thread: [PATCH] drm: fix crash due to /proc registration race by Arjan van de Ven on Saturday, May 31, 2008 - 8:39 pm. (5 messages)

Next thread: [POWERPC] disabling CONFIG_BUG breaks kernel compile by Keri Harris on Saturday, May 31, 2008 - 9:23 pm. (1 message)
From: Chris Clayton
Date: Saturday, May 31, 2008 - 9:15 pm

Hi,

I've just started testing 2.6.26 on my desktop machine and have found that I get 
a long pause (20 seconds or so) as the system boots. By pause I mean that the 
boot messages that normally scroll by stop doing so for, as I say, about 20 
seconds.

I've done a bisect and arrived at:

b0ed43360fdca227048d88a08290365cb681c1a8 is first bad commit
commit b0ed43360fdca227048d88a08290365cb681c1a8
Author: Hannes Reinecke <hare@suse.de>
Date:   Tue Mar 18 14:32:28 2008 +0100

    [SCSI] add scsi_host and scsi_target to scsi_bus

    This patch implements scsi_host and scsi_target device types
    and adds both to the scsi_bus.

    Signed-off-by: Hannes Reinecke <hare@suse.de>
    Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
46e4fd28cc537ba700eb4f5652abf043fed4d76d M      drivers

The config file for that kernel is attached as are the kernel log for booting 
that kernel and the bisect log.

I've also noticed that it takes much longer for icons to appear on my desktop 
when I plug a usb storage device in. (The icons are shown/hidden by a simple 
sctript that is run via udev when the device is attached/detached.)

Let me know of any way I can help solve this.

Please cc me to any reply as I'm not subscribed.

Chris
-- 
Beauty is in the eye of the beerholder.
From: Andrew Morton
Date: Thursday, June 5, 2008 - 10:17 pm

cc's added.

Rafael, we might need to track this as a post-2.6.26 regression.

--

From: Chris Clayton
Date: Thursday, June 5, 2008 - 11:07 pm

Thanks for the reply, Andrew. I was beginning to think that I'd committed an 
LKML sin :-)

Yes, it is about the same. On 2.6.25.4 (and earlier), the icons appear within a 
second or two, but with 2.6.26-rc2 or later, it takes about 20 seconds. I've 
now built and installed -rc5 and am seeing the same behaviour.

I've also built -rc5 on my laptop. On that my kernel uses the older IDE drivers 
rather than libata. The boot proceeds with no pause, but I do see the same 



-- 
Beauty is in the eye of the beerholder.
--

From: Andrew Morton
Date: Thursday, June 5, 2008 - 11:21 pm

Well you did - you reported a bug.

<watches wodim wibble for five minutes then report "CD/DVD-Recorder

OK that makes sense.  ata and usb use the scsi layer.  IDE does not.

--

From: Andrew Morton
Date: Thursday, June 5, 2008 - 11:25 pm

It goes much better when the CD is inserted upside up.
--

From: James Bottomley
Date: Friday, June 6, 2008 - 8:22 am

So I can close this regression if I send you a pack of CD ring labels
with "other way up" on them for you to affix to the underside of your
CDs ... ?

James


--

From: Andrew Morton
Date: Friday, June 6, 2008 - 11:03 am

Would prefer double-sided CDs.

Is this 20-second-delay thing known-about/expected/etc?
--

From: James Bottomley
Date: Friday, June 6, 2008 - 11:40 am

I just discovered my budget only runs to a permanent marker for you to

If it's not the scsi_bus_uevent problem, then no, it isn't.

The odd thing is that it occurs in the middle of USB initialisation from

Note the jump from 19 to 44 seconds.



I'm a bit baffled as to how that could be SCSI related, but I suppose it
could be udev related somehow.  Perhaps turning on driver core debugging
might help (that's CONFIG_DEBUG DRIVER ... it depends on
CONFIG_DEBUG_KERNEL for activation).  Hopefully the verbose kernel logs
from here might tell us what's going on in the lacunae.

James


--

From: Chris Clayton
Date: Friday, June 6, 2008 - 10:16 pm

OK, here's a boot log with driver core debugging turned on. Hope it helps!

(Oh, and thanks for adding the word lacunae to my vocabulary. I'd never com=
e=20
across it before.)




=2D-=20
Beauty is in the eye of the beerholder.
From: Chris Clayton
Date: Saturday, June 7, 2008 - 11:19 pm

--Boundary-00=_NG3SIU63+Zdv2JV
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline


Because of this oddity, I've also built 2.6.26-rc5-git2 with verbose USB 
debugging on. A boot log from that kernel is attached. You will see lots of 
ETIME errors from usb. A log from a similarly verbose 2.6.25.5 kernel does not 

Triggered by James' bafflement and for my own peace of mind, I've run another
bisect session. This time I started with v2.6.26 as good and v2.6.26-rc2 as bad.
It highlighted the same patch as the "culprit".



--
Beauty is in the eye of the beerholder.

From: James Bottomley
Date: Sunday, June 8, 2008 - 11:37 am

I'm starting to think this might be a distribution problem.  I tried
booting my laptop on your config (only system I have with ata_piix)
without an initrd (bit painful for fedora 9) and I didn't see any slow
downs (in fact it was a lot faster than previously; makes note to file
bug with redhat about their initramfs).

The one thing the patch you identified does is to add the SCSI target to
the BUS and type model for the generic device.  What this does is to add
extra uevents for the target.  If your hotplug handlers aren't set up to
expect this, they could be delaying the system progress.


Actually, that's not really necessary.  Bisection is taking you back
through kernel history, so unless the bug has two separate causes, it's
pretty likely to home in on the same section of code.

James


--

From: Greg KH
Date: Monday, June 9, 2008 - 4:07 pm

What distro is this being seen on?  I know that openSUSE 11.0 betas did
have a bug in their init scripts that was causing a long delay, much
like this one.  But it has been fixed in the later releases
(hopefully...)

thanks,

greg k-h
--

From: Chris Clayton
Date: Tuesday, June 10, 2008 - 11:49 am

As I repied to James a few minutes ago, my desktop system was originally (the 
now lapsed) Peanut Linux, which was a Slackware derivative. But I've been 
running it for 3 or 4 years, upgrading and adding packages along the way. My 
udev setup is based on that in a fairly recent Slackware, although I can't now 
recall precisely which version.

Thanks




-- 
Beauty is in the eye of the beerholder.
--

From: Chris Clayton
Date: Tuesday, June 10, 2008 - 3:06 pm

[Apologies to copy recipients - I accidentally replied to only James earlier 
this evening. I've enclosed the text of my reply in square brackets at the 
relevant places below, followed, where appropriate, by my findings from 
tonight's investigations]

 [OK, I'll start investigating this line. My desktop system was originally (the 
 now lapsed) Peanut Linux, which was a Slackware derivative. But I've been 
 running it for 3 or 4 years, upgrading and adding packages along the way. My 
 udev setup is based on that in a fairly recent Slackware, although I can't now 
 recall precisely which version.]

For the time being I'll try and isolate what it is that causes the delay in 
icons appearing on my desktop when I plug in a usb-storage device. The script 
that shows the icon is invoked by adding the script name to udev's run list. 
Here is a typical rule from /etc/udev/rules.d/20-local.rules:

ACTION=="add", KERNEL=="sd[a-z][0-9]", SYSFS{idProduct}=="6545", 
SYSFS{idVendor}=="0930", SYMLINK+="usbstickc%n", MODE="0666", 
RUN+="/usr/bin/udevshowicon.sh"

I've added a couple of calls to logger (from util-linux-2.12r) to the script. 
What this has shown is that, when I plug in a usb-storage device, there seems 
to be a delay between the usb/scsi/usb-storage drivers setting up the device 
and my icon-showing script being run. Here is the snippet from the kernel log 
file:

Jun 10 20:27:45 upstairs kernel: [ 3200.592107] hub 3-0:1.0: state 7 ports 5 chg 
0000 evt 0004
Jun 10 20:27:45 upstairs kernel: [ 3200.592107] ehci_hcd 0000:02:0a.2: GetStatus 
port 2 status 001803 POWER sig=j CSC CONNECT
Jun 10 20:27:45 upstairs kernel: [ 3200.592107] hub 3-0:1.0: port 2, status 
0501, change 0001, 480 Mb/s
Jun 10 20:27:45 upstairs kernel: [ 3200.695130] hub 3-0:1.0: debounce: port 2: 
total 100ms stable 100ms status 0x501
Jun 10 20:27:45 upstairs kernel: [ 3200.746270] ehci_hcd 0000:02:0a.2: port 2 
high speed
Jun 10 20:27:45 upstairs kernel: [ 3200.746279] ehci_hcd 0000:02:0a.2: GetStatus ...
From: Alan Stern
Date: Tuesday, June 10, 2008 - 3:19 pm

...

You can eliminate this 5-second delay by adding

	options usb-storage delay_use=0

to /etc/modprobe.conf.  Of course, this will make the other delays just 
stand out that much more sharply...

Alan Stern

--

From: Chris Clayton
Date: Wednesday, June 11, 2008 - 11:58 am

Hi everyone,



I've done that poking around and found something that to me looks interesting.

Before plugging in a usb stick, I changed udev's log_priority to "debug" (using 
udevcontrol). Attached is the snippet from /var/log/messages that shows all the 
diagnostics emitted by udev when the stick was plugged in. (This is with 
2.6.26-rc5-git5). You will see that there are several instances of a message 
about udev unsuccesfully waiting for a file ioerr_cnt to appear deep down in 
the guts of /sys. Eventually, it does appear, but only after 21 seconds have 
elapsed. On 2.6.25.6, the file appears almost as soon as udev starts looking 
for it. The log extract showing this is also attached as is an extract from the 
kernel log for 2.6.26-rc5-git5.

By the way, the main udev-rules file that is installed is the one for slackware 
that comes with the udev package from kernel.org.  So the file these tests were 
run with are that from udev-122, which I upgraded to earlier this evening (more 
in hope than expectation :-)




-- 
Beauty is in the eye of the beerholder.
From: Greg KH
Date: Wednesday, June 11, 2008 - 12:15 pm

This doesn't look good.  It looks like a udev rule in your system is
looking for a file that will never show up.  That's a short way to a
long delay time :)

Try commenting out the rule that does this and see if things are fixed.

thanks,

greg k-h
--

From: James Bottomley
Date: Wednesday, June 11, 2008 - 12:24 pm

Actually, there's something seriously wrong here:  ioerr_cnt is a
property of the device, not of the host (as in it will never appear
under .../host4 but under .../host4/targetX:X:X/4:X:X:X/ioerr_cnt).
Perhaps an investigation of why udevd-event is looking under host4/ is
in order.

James


--

From: Stefan Richter
Date: Wednesday, June 11, 2008 - 12:29 pm

So the change which commit b0ed43360fdca227048d88a08290365cb681c1a8 
"[SCSI] add scsi_host and scsi_target to scsi_bus" introduced is leading 
the udev scripts onto a wrong trail?

(This commit was the one which Chris found by bisection, see 
http://lkml.org/lkml/2008/6/1/2)
-- 
Stefan Richter
-=====-==--- -==- -=-==
http://arcgraph.de/sr/
--

From: James Bottomley
Date: Wednesday, June 11, 2008 - 12:36 pm

Yes, I think so ... what it does is make us get bus events for the
target as well as the lun, whereas without it we only get them for the
lun.  I think something is misparsing the new target event and this is
where the trouble is coming from.

James


--

From: Chris Clayton
Date: Thursday, June 12, 2008 - 11:56 pm

Hi James,


I assume that since we now the source of this problem, you can create it at 
will. I only mention because if you need me to test a patch, it will have to be 
before Tuesday, June 17 - I go on holiday (vacation) on that day.

Thanks




-- 
Beauty is in the eye of the beerholder.
--

From: James Bottomley
Date: Friday, June 13, 2008 - 7:12 am

Yes, we can; thanks!

The problem though is that it's in the udev scripts, so there's nothing
we can fix in the kernel, we now need to persuade the various distros to
fix it (if they're actually broken), so it might take a while.

James


--

From: Greg KH
Date: Wednesday, June 11, 2008 - 12:33 pm

I blame a badly written udev rule :)

thanks,

greg k-h
--

From: Chris Clayton
Date: Wednesday, June 11, 2008 - 12:51 pm

Yes, that's it. It's this rule, which appears as the second rule in the 
slackware rules file that's distributed with udev (and it's been there since 
udev-97 in August 2006):

ACTION=="add", SUBSYSTEM=="scsi", WAIT_FOR_SYSFS="ioerr_cnt"

Commenting that rule out eliminates the pause in the boot and in the appearance 
of icons on my desktop.

Thanks,



-- 
Beauty is in the eye of the beerholder.
--

From: Stefan Richter
Date: Wednesday, June 11, 2008 - 2:04 pm

It's called a workaround, not a fix.
-- 
Stefan Richter
-=====-==--- -==- -=-==
http://arcgraph.de/sr/
--

From: Stefan Richter
Date: Wednesday, June 11, 2008 - 2:46 pm

FWIW, Gentoo's udev 119 has this:
$ grep -r ioerr_cnt /etc/udev/
/etc/udev/rules.d/05-udev-early.rules:ACTION=="add", 
KERNEL=="[0-9]*:[0-9]*", SUBSYSTEM=="scsi", WAIT_FOR_SYSFS="ioerr_cnt"

Whatever this rule is good for...  Maybe the pattern should be 
"[0-9]+(:[0-9]+){3}" to match logical units only.
-- 
Stefan Richter
-=====-==--- -==- -=-==
http://arcgraph.de/sr/
--

From: Stefan Richter
Date: Wednesday, June 11, 2008 - 3:03 pm

Actually no; looks like these are just glob(7) patterns.  Then the 
pattern might already be OK since host devices' and target devices' 
names don't start with decimal numbers.
-- 
Stefan Richter
-=====-==--- -==- -=-==
http://arcgraph.de/sr/
--

From: Chris Clayton
Date: Friday, June 13, 2008 - 11:03 pm

For Slackware 12.0.0, the rule is:

ACTION=="add", KERNEL=="[0-9]*:[0-9]*", SUBSYSTEM=="scsi", 
WAIT_FOR_SYSFS="ioerr_cnt"

Using that rule instaed of the old one in the Slackware rules file included with 
udev appears to result in the same behaviour as with the old rule and kernel < 
2.6.26.

Unless someone warns me otherwise, I think I'll go with that on my (homebrewed) 
system, so thanks to everyone for your help.

Chris
-- 
Beauty is in the eye of the beerholder.
--

From: Alan Stern
Date: Monday, June 9, 2008 - 7:46 am

In fact the log shows that the 30-second delay occurs in the middle of
ohci1394 (Firewire) initialization, not USB initialization:

Jun  8 05:26:35 upstairs kernel: ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[22]  MMIO=[fbfff000-fbfff7ff]  Max Packet=[2048]  IR/IT contexts=[4/8] 
Jun  8 05:26:35 upstairs kernel: ohci_hcd 0000:02:0a.0: urb f71cb680 path 2 ep1in 5e160000 cc 5 --> status -62 
Jun  8 05:26:35 upstairs kernel: hub 1-2:1.0: transfer --> -62 
Jun  8 05:27:08 upstairs kernel: driver: '0000:02:0b.0': driver_bound: bound to device 'ohci1394' 
Jun  8 05:27:08 upstairs kernel: bus: 'pci': really_probe: bound device 0000:02:0b.0 to driver ohci1394 

Those -62 errors from ohci_hcd are red herrings.  To test this, try 
booting with "nousb".

Alan Stern

--

From: Stefan Richter
Date: Monday, June 9, 2008 - 12:10 pm

[Empty message]
From: Alan Stern
Date: Monday, June 9, 2008 - 12:40 pm

It's also worth noting that the first 17:25:44 message is highly 
suspicious.  It is the initial output of the ohci-hcd driver, 
indicating that the udev/hotplug system had just loaded that driver.  
This doesn't seem to have anything to do with the SCSI stack.

Whatever caused the delay, it also managed to pause the other hotplug 
task which was in the midst of loading uhci-hcd.

Alan Stern

--

From: Chris Clayton
Date: Monday, June 9, 2008 - 2:35 pm

Sorry folks, I'm unexpectedly away from the computers that these
"problems" occur on.

I should be back at home tomorrow and will answer you all as quickly
as possible then.

Thanks,




-- 
Beauty is in the eye of the beerholder.
--

From: Chris Clayton
Date: Tuesday, June 10, 2008 - 12:00 pm

Ah sorry, that's a legacy of when I had considerably less ram than 2GB. I only 
use firewire on the rare occasion that I want to grab some video from my camera 
and I use the parallel port even less frequently. So one of the last things 

When I was bisecting, I assessed good or bad based on whether I experienced the 
long pause during boot up. You may have seen my reply to James a short while 
ago that I have bisected again and arrived at the same patch.

Thanks,

Chris



-- 
Beauty is in the eye of the beerholder.
--

From: Stefan Richter
Date: Tuesday, June 10, 2008 - 12:38 pm

On an unrelated note:  If the driver was automatically loaded in a 
hotplug-like fashion (based on PCI IDs and module aliases), you could 
prevent it with "blacklist ohci1394" in one of the modprobe 
configuration files.
-- 
Stefan Richter
-=====-==--- -==- -=-=-
http://arcgraph.de/sr/
--

From: James Bottomley
Date: Friday, June 6, 2008 - 7:11 am

It already is ... as I asked in the actual bug (but no-one seems to have
noticed, so I assume the bug entry isn't set up correctly).  Does this
still persist with 2.6.26-rc5?  The reason for asking is that it looks
very much like a silent manifestation of our scsi_bus_uevent() problem
that got fixed in 2.6.26-rc5.

James


--

From: Rafael J. Wysocki
Date: Friday, June 6, 2008 - 7:42 am

The bug reporter has no Bugzilla account, so I couldn't add him to the CC list
of the bug.

Thanks,
Rafael
--

From: Chris Clayton
Date: Friday, June 6, 2008 - 10:42 am

I do have a bugzilla account, but for some reason I couldn't log in to it to 
report this bug, so I mailed LKML to get the ball rolling. I can now log in so 
I've added myself to the CC list.




-- 
Beauty is in the eye of the beerholder.
--

From: Chris Clayton
Date: Friday, June 6, 2008 - 10:35 am

Yes, I still get the boot pause and the delay in desktop icons appearing when a 
usb-storage device is plugged in with -rc5. A boot log is attached. Is there 
some sort of verbose scsi boot messaging that I can switch on? I've looked but 
I'm damned if I can find anything in 'make menuconfig'.




-- 
Beauty is in the eye of the beerholder.
From: Grant Grundler
Date: Sunday, June 8, 2008 - 9:51 pm

SCSI_CONSTANTS and SCSI_LOGGING might help.
Both live in the SCSI options menu. Read the Kconfig help on how
to use SCSI_LOGGING.

Can you please also enable CONFIG_PRINTK_TIME ?

Using the timestamps in /var/log/messages is ok but not the most
reliable place to get somewhat accurate time stamps of events.

hth,
grant
--

From: Chris Clayton
Date: Tuesday, June 10, 2008 - 11:46 am

I've enabled CONFIG_PRINTK_TIME and will look into the SCSI options.




-- 
Beauty is in the eye of the beerholder.
--

From: Rafael J. Wysocki
Date: Friday, June 6, 2008 - 3:45 am

We're tracking it as http://bugzilla.kernel.org/show_bug.cgi?id=10861 .

Thanks,
Rafael
--

Previous thread: [PATCH] drm: fix crash due to /proc registration race by Arjan van de Ven on Saturday, May 31, 2008 - 8:39 pm. (5 messages)

Next thread: [POWERPC] disabling CONFIG_BUG breaks kernel compile by Keri Harris on Saturday, May 31, 2008 - 9:23 pm. (1 message)