Re: [BUG] 2.6.23-rc3 can't see sd partitions on Alpha

Previous thread: increased number of cycles by kernel coder on Saturday, November 17, 2007 - 10:31 pm. (4 messages)

Next thread: [PATCH]new_TSC_based_delay_tsc() by Marin Mitov on Sunday, November 18, 2007 - 2:20 am. (2 messages)
From: Bob Tracy
Date: Saturday, November 17, 2007 - 10:20 pm

Completely reproducible... 2.6.23-rc3 kernel boots, and normal messages
are seen on console as far as disks found and partitions on each.  However,
once /dev is populated and the boottime scripts attempt to check filesystem
status, no partitions on either of the two disks attached to the SCSI
controller are seen.  Dropping into a single-user root shell confirms
the sudden "blindness": fdisk can't open /dev/sda.

When I reboot on 2.6.24-rc2, everything works normally.

System environment is Debian Etch.  Both 2.6.24-rc2 and -rc3 were built
from the respective unaltered kernel.org source trees, using the same
kernel configuration modulo saying "no" to CONFIG_SENSORS_I5K_AMB and
CONFIG_PID_NS in -rc3.  No problems with -rc3 on a x86 box.

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an elephant at this dist- "
rct@frus.com       |   - Last words of Union General John Sedgwick,
                   |  Battle of Spotsylvania Court House, U.S. Civil War
------------------------------------------------------------------------
-

From: Rafael J. Wysocki
Date: Sunday, November 25, 2007 - 5:15 am

Added to the list, http://bugzilla.kernel.org/show_bug.cgi?id=9457 .

Thanks,
Rafael
-

From: Bob Tracy
Date: Monday, November 26, 2007 - 6:48 am

I was out of town last week, and will be out this week as well.  Won't be
able to do the bisection until next week at the earliest, but I have remote
access to the box if there's anything useful to be done that doesn't require
a reboot.  No logs available for the "no sd access" case: I'd have to rig up
something to record the console output during boot if that's needed.  Here's
hoping someone else is seeing this or can replicate it in the meantime.

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an elephant at this dist- "
rct@frus.com       |   - Last words of Union General John Sedgwick,
                   |  Battle of Spotsylvania Court House, U.S. Civil War
------------------------------------------------------------------------
-

From: Michael Cree
Date: Friday, November 30, 2007 - 3:30 pm

Snap.

2.6.24-rc2 works fine.   2.6.24-rc3 boots on Alpha but once /dev is 
populated no partitions of the scsi sub-system are seen.  Looks like ide 
sub-system similarly affected.

Managed to get boot log.  Follows below (with output of various /proc info).

Cheerz
Michael.


Linux version 2.6.24-rc3 (mjc@alpha) (gcc version 4.1.3 20071019 
(prerelease) (Debian 4.1.2-17)) #1 Mon Nov 26 19:28:58 NZDT 2007
Booting on Tsunami variation Monet using machine vector Monet from SRM
Major Options: EV67 LEGACY_START VERBOSE_MCHECK
Command line: ro root=/dev/sda3 console=ttyS0
memcluster 0, usage 1, start        0, end      215
memcluster 1, usage 0, start      215, end   131062
memcluster 2, usage 1, start   131062, end   131072
freeing pages 215:384
freeing pages 930:131062
reserving pages 930:932
4096K Bcache detected; load hit latency 21 cycles, load miss latency 127 
cycles
Console graphics on hose 0
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 130167
Kernel command line: ro root=/dev/sda3 console=ttyS0
PID hash table entries: 4096 (order: 12, 32768 bytes)
Using epoch = 2000
Turning on RTC interrupts.
Console: colour VGA+ 80x25
console [ttyS0] enabled
Dentry cache hash table entries: 131072 (order: 7, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 6, 524288 bytes)
Memory: 1030896k/1048496k available (2786k kernel code, 15216k reserved, 
370k data, 168k init)
Mount-cache hash table entries: 512
net_namespace: 120 bytes
NET: Registered protocol family 16
PCI: Bridge: 0001:01:08.0
   IO window: 8000-8fff
   MEM window: 09000000-090fffff
   PREFETCH window: disabled.
SMC37c669 Super I/O Controller found @ 0x3f0
Linux Plug and Play Support v0.97 (c) Adam Belay
SCSI subsystem initialized
NET: Registered protocol family 2
IP route cache hash table entries: 8192 (order: 3, 65536 bytes)
TCP established hash table entries: 32768 (order: 6, 524288 bytes)
TCP bind hash table entries: 32768 (order: 5, 262144 bytes)
TCP: Hash tables ...
From: Andrew Morton
Date: Friday, November 30, 2007 - 3:42 pm

On Sat, 01 Dec 2007 11:30:01 +1300


I guess this is where things go bad.

scsi_id is part of udev.  Perhaps some sysfs nodes aren't being created
correctly.

Random guess: what is your setting of CONFIG_SCSI_SCAN_ASYNC and what
-

From: Michael Cree
Date: Sunday, December 2, 2007 - 1:53 pm

[Empty message]
From: Bob Tracy
Date: Sunday, December 2, 2007 - 6:17 pm

Thanks for the confirmation of the error condition.  As best I can
recall, your boot log is substantially the same as what I saw.

Finally got back in town.  Starting the git-bisect process.  I've got
a relatively slow network connection, and the PWS 433au isn't exactly
what I would call "fast" by modern standards, so bear with me while I
get things set up and crank through this.  The clone of the 2.6 tree
will take several more hours to finish downloading.  I anticipate the
best pace I'll be able to manage after that is two iterations in a 24-
hour period.

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an elephant at this dist- "
rct@frus.com       |   - Last words of Union General John Sedgwick,
                   |  Battle of Spotsylvania Court House, U.S. Civil War
------------------------------------------------------------------------
--

From: Ingo Molnar
Date: Tuesday, December 4, 2007 - 5:16 am

once you are done with the download of the initial cloned git repository 
(which is 200MB+), all the bisection steps will be local and you'll be 
only limited by kernel rebuild speed and by bootup and testing speed, 
not by network bandwidth.

( once you have the cloned repository i'd suggest for you to keep it - 
  that way you can track susequent kernels via "git-pull" and it uses a 
  very network-efficient delta protocol. )

	Ingo
--

From: Bob Tracy
Date: Tuesday, December 4, 2007 - 8:36 am

ACK.  Have tested two kernels in the past 24 hours, and the third is
building as I type this.  The builds seem to be taking about 3 hours
each.  First two tests good, so the offending commit is somewhere in
the last 25% (roughly) of the changes between -rc2 and -rc3: git says
82 revisions left to test.  Might have this painted into a corner in

Will do...  I'm in the fortunate position of having enough disk space
on my Alpha that I can maintain multiple trees for this kind of effort.

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an elephant at this dist- "
rct@frus.com       |   - Last words of Union General John Sedgwick,
                   |  Battle of Spotsylvania Court House, U.S. Civil War
------------------------------------------------------------------------
--

From: Bob Tracy
Date: Thursday, December 6, 2007 - 5:16 pm

OK.  Finally have this thing painted into a corner: git has identified
6f37ac793d6ba7b35d338f791974166f67fdd9ba as the first bad commit.

From "git bisect log", this corresponds to 

# bad: [6f37ac793d6ba7b35d338f791974166f67fdd9ba] Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6

Here's the full log:

git-bisect start
# good: [9aae299f7fd1888ea3a195cfe0edef17bb647415] Linux 2.6.24-rc2
git-bisect good 9aae299f7fd1888ea3a195cfe0edef17bb647415
# bad: [f05092637dc0d9a3f2249c9b283b973e6e96b7d2] Linux 2.6.24-rc3
git-bisect bad f05092637dc0d9a3f2249c9b283b973e6e96b7d2
# good: [e6a5c27f3b0fef72e528fc35e343af4b2db790ff] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm
git-bisect good e6a5c27f3b0fef72e528fc35e343af4b2db790ff
# good: [42614fcde7bfdcbe43a7b17035c167dfebc354dd] vmstat: fix section mismatch warning
git-bisect good 42614fcde7bfdcbe43a7b17035c167dfebc354dd
# bad: [a052f4473603765eb6b4c19754689977601dc1d1] Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/x86
git-bisect bad a052f4473603765eb6b4c19754689977601dc1d1
# good: [d8e5219f9f5ca7518eb820db9f3d287a1d46fcf5] CRISv10 improve and bugfix fasttimer
git-bisect good d8e5219f9f5ca7518eb820db9f3d287a1d46fcf5
# good: [d90bf5a976793edfa88d3bb2393f0231eb8ce1e5] [NET]: rt_check_expire() can take a long time, add a cond_resched()
git-bisect good d90bf5a976793edfa88d3bb2393f0231eb8ce1e5
# good: [2a113281f5cd2febbab21a93c8943f8d3eece4d3] kconfig: use $K64BIT to set 64BIT with all*config targets
git-bisect good 2a113281f5cd2febbab21a93c8943f8d3eece4d3
# good: [2e2cd8bad6e03ceea73495ee6d557044213d95de] CRISv10 memset library add lineendings to asm
git-bisect good 2e2cd8bad6e03ceea73495ee6d557044213d95de
# bad: [6f37ac793d6ba7b35d338f791974166f67fdd9ba] Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
git-bisect bad 6f37ac793d6ba7b35d338f791974166f67fdd9ba
# good: [2f1f53bdc6531696934f6ee7bbdfa2ab4f4f62a3] CRISv10 fasttimer: Scrap ...
From: Andrew Morton
Date: Thursday, December 6, 2007 - 5:33 pm

On Thu, 6 Dec 2007 18:16:12 -0600 (CST)

commit 6f37ac793d6ba7b35d338f791974166f67fdd9ba
Merge: 2f1f53b... d90bf5a...
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date:   Wed Nov 14 18:51:48 2007 -0800

    Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/n
    
    * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
      [NET]: rt_check_expire() can take a long time, add a cond_resched()
      [ISDN] sc: Really, really fix warning
      [ISDN] sc: Fix sndpkt to have the correct number of arguments
      [TCP] FRTO: Clear frto_highmark only after process_frto that uses it
      [NET]: Remove notifier block from chain when register_netdevice_notifier f
      [FS_ENET]: Fix module build.
      [TCP]: Make sure write_queue_from does not begin with NULL ptr
      [TCP]: Fix size calculation in sk_stream_alloc_pskb
      [S2IO]: Fixed memory leak when MSI-X vector allocation fails
      [BONDING]: Fix resource use after free
      [SYSCTL]: Fix warning for token-ring from sysctl checker
      [NET] random : secure_tcp_sequence_number should not assume CONFIG_KTIME_S
      [IWLWIFI]: Not correctly dealing with hotunplug.
      [TCP] FRTO: Plug potential LOST-bit leak
      [TCP] FRTO: Limit snd_cwnd if TCP was application limited
      [E1000]: Fix schedule while atomic when called from mii-tool.
      [NETX]: Fix build failure added by 2.6.24 statistics cleanup.
      [EP93xx_ETH]: Build fix after 2.6.24 NAPI changes.
      [PKT_SCHED]: Check subqueue status before calling hard_start_xmit

I'm struggling to see how any of those could have broken block device
mounting on alpha.  Are you sure you bisected right?

--

From: Bob Tracy
Date: Thursday, December 6, 2007 - 10:07 pm

Based on what's in that commit, it *does* appear something went wrong
with bisection.  If the implicated commit is the next one in time
sequence relative to

# good: [2f1f53bdc6531696934f6ee7bbdfa2ab4f4f62a3] CRISv10 fasttimer: Scrap INLINE and name timeval_cmp better

then the test of whether I bisected correctly is as simple as applying
the commit and seeing if things break, because I'm running on the
kernel corresponding to 2f1f53bdc6531696934f6ee7bbdfa2ab4f4f62a3 right
now.  Let me give that a try and I'll report back.  Worst case, I'll
have to start over and write off the past four days...

Sorry about this...

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an elephant at this dist- "
rct@frus.com       |   - Last words of Union General John Sedgwick,
                   |  Battle of Spotsylvania Court House, U.S. Civil War
------------------------------------------------------------------------
--

From: Andrew Morton
Date: Friday, December 7, 2007 - 3:26 am

Gad.  I trust the second time will be faster.

git-bisect _is_ very error prone.  I find one of the problems is that each
step is so far apart in time that you forget what you were doing.  Did I

Not appropriate ;)   Thanks for helping out.
--

From: Ingo Molnar
Date: Friday, December 7, 2007 - 4:37 am

i have a fully automated bootup-hang bisection script. It is based on 
"git-bisect run". I run the script, it builds and boots kernels fully 
automatically, and when the bootup fails (the script notices that via 
the serial log, which it continuously watches - or via a timeout, if the 
system does not come up within 10 minutes it's a "bad" kernel), the 
script raises my attention via a beep and i power cycle the test box. 
(yeah, i should make use of a managed power outlet to 100% automate it) 

So i dont have to a single manual decision anytime during the bisection. 
But the scripts are very much tied to my ad-hoc test environment so it 
would not be of much general use.

	Ingo
--

From: Bob Tracy
Date: Friday, December 7, 2007 - 6:39 am

Thanks for the kind words...  The above-mentioned test verified that the
bisection was/is correct: 2f1f53bdc6531696934f6ee7bbdfa2ab4f4f62a3 works,
and 6f37ac793d6ba7b35d338f791974166f67fdd9ba doesn't.  Now I've got to
figure out why.

"git diff 2f1f53bdc6531696934f6ee7bbdfa2ab4f4f62a3 6f37ac793d6ba7b35d338f791974166f67fdd9ba"
produced a relatively short patch (18,437 bytes).  The list of involved
files:

diff --git a/drivers/char/random.c b/drivers/char/random.c
diff --git a/drivers/isdn/sc/card.h b/drivers/isdn/sc/card.h
diff --git a/drivers/isdn/sc/packet.c b/drivers/isdn/sc/packet.c
diff --git a/drivers/isdn/sc/shmem.c b/drivers/isdn/sc/shmem.c
diff --git a/drivers/net/arm/ep93xx_eth.c b/drivers/net/arm/ep93xx_eth.c
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
diff --git a/drivers/net/fs_enet/Kconfig b/drivers/net/fs_enet/Kconfig
diff --git a/drivers/net/fs_enet/Makefile b/drivers/net/fs_enet/Makefile
diff --git a/drivers/net/netx-eth.c b/drivers/net/netx-eth.c
diff --git a/drivers/net/s2io.c b/drivers/net/s2io.c
diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
diff --git a/include/net/sock.h b/include/net/sock.h
diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
diff --git a/net/core/dev.c b/net/core/dev.c
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c

Current state of the source tree is the 6f37ac... version, so I'll start
backing out the above diffs in related groups and continue until I've got
a working kernel.  For lack of an obvious target, I'll start with the
seemingly innocuous change to sysctl_check.c.  I'll report back when I've
got something.

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an ...
From: Bob Tracy
Date: Friday, December 7, 2007 - 7:55 am

That was quick :-).  Backing out the sysctl_check.c diff gives me a
working kernel.  Beats the #$%@! out of me how/why, though.

Michael Cree: could you try backing out the diff below from your
2.6.24-rc3 tree and see if things are now working for you?

Here's "uname -a", just to confirm (maybe) I'm running on what I say
works:

Linux smirkin 2.6.24-rc2-g6f37ac79-dirty #2 Fri Dec 7 08:03:12 CST 2007 alpha

Here's the diff I backed out (patch -R).  It's short...

diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index 5a2f2b2..4abc6d2 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -738,7 +738,7 @@ static struct trans_ctl_table trans_net_table[] = {
 	{ NET_ROSE,		"rose",		trans_net_rose_table },
 	{ NET_IPV6,		"ipv6",		trans_net_ipv6_table },
 	{ NET_X25,		"x25",		trans_net_x25_table },
-	{ NET_TR,		"tr",		trans_net_tr_table },
+	{ NET_TR,		"token-ring",	trans_net_tr_table },
 	{ NET_DECNET,		"decnet",	trans_net_decnet_table },
 	/*  NET_ECONET not used */
 	{ NET_SCTP,		"sctp",		trans_net_sctp_table },

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an elephant at this dist- "
rct@frus.com       |   - Last words of Union General John Sedgwick,
                   |  Battle of Spotsylvania Court House, U.S. Civil War
------------------------------------------------------------------------
--

From: Ingo Molnar
Date: Friday, December 7, 2007 - 8:05 am

reverting this makes the kernel image shorter by 8 bytes - so perhaps 
some alignment issue somewhere? Or something gets overflown? Does any of 
this get actually used by your bootup?

	Ingo
--

From: Ingo Molnar
Date: Friday, December 7, 2007 - 11:06 am

i'm not sure how to do direct debugging on udev, so i can only guess 
about what effect on the kernel side could have caused this. One bad 
hack would be to "probe" udevd's behavior by changing the NET_TR entry 
in various ways:

  "tr" -> "token-ring"         # breaks
  "tr" -> "tr"                 # works
  "tr" -> "token-rin0"         # ?            (1)
  "tr" -> "TR"                 # ?            (2)

the question is, does tweak (1) and tweak (2) work or break?

but it would be a lot more effective i guess to get some udevd expert's 
attention on this ...

	Ingo
--

From: Kay Sievers
Date: Friday, December 7, 2007 - 11:19 am

Could we get the output of:
  ls -l /sys/block/sda/
and:
  grep . /sys/block/sda/*/dev
?

Kay

--

From: Bob Tracy
Date: Friday, December 7, 2007 - 12:36 pm

Here are the requested items for the 2.6.24-rc2-g6f37ac79-dirty kernel
(the working one with the sysctl_check.c patch reverted):

smirkin:/# ls -l /sys/block/sda
total 0
-r--r--r-- 1 root root 8192 Dec  7 08:36 capability
-r--r--r-- 1 root root 8192 Dec  7 08:36 dev
lrwxrwxrwx 1 root root    0 Dec  7 08:36 device -> ../../devices/pci0000:00/0000:00:14.0/0000:01:09.0/host0/target0:0:0/0:0:0:0
drwxr-xr-x 2 root root    0 Dec  7 08:36 holders
drwxr-xr-x 3 root root    0 Dec  7 08:36 queue
-r--r--r-- 1 root root 8192 Dec  7 08:36 range
-r--r--r-- 1 root root 8192 Dec  7 08:36 removable
drwxr-xr-x 3 root root    0 Dec  7 08:36 sda1
drwxr-xr-x 3 root root    0 Dec  7 08:36 sda2
drwxr-xr-x 3 root root    0 Dec  7 08:36 sda3
drwxr-xr-x 3 root root    0 Dec  7 08:36 sda4
drwxr-xr-x 3 root root    0 Dec  7 08:36 sda5
drwxr-xr-x 3 root root    0 Dec  7 08:36 sda6
drwxr-xr-x 3 root root    0 Dec  7 08:36 sda7
-r--r--r-- 1 root root 8192 Dec  7 08:36 size
drwxr-xr-x 2 root root    0 Dec  7 08:36 slaves
-r--r--r-- 1 root root 8192 Dec  7 08:36 stat
lrwxrwxrwx 1 root root    0 Dec  7 08:36 subsystem -> ../../block
--w------- 1 root root 8192 Dec  7 08:36 uevent
smirkin:/# grep . /sys/block/sda/*/dev
/sys/block/sda/sda1/dev:8:1
/sys/block/sda/sda2/dev:8:2
/sys/block/sda/sda3/dev:8:3
/sys/block/sda/sda4/dev:8:4
/sys/block/sda/sda5/dev:8:5
/sys/block/sda/sda6/dev:8:6
/sys/block/sda/sda7/dev:8:7

Assuming /sys/block even exists for the non-working case, I'll forward
that info in a few hours when I can get home to reboot the machine.

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an elephant at this dist- "
rct@frus.com       |   - Last words of Union General John Sedgwick,
                   |  Battle of Spotsylvania Court House, U.S. Civil War
------------------------------------------------------------------------
--

From: Michael Cree
Date: Friday, December 7, 2007 - 1:43 pm

Yes (conference is now over).  I backed out the sysctl_check patch from 
2.6.24-rc3 and, indeed, got a working kernel.

The working kernel (was probably 2.6.24-rc3 less sysctl_check patch, but 
might have been a 2.6.23 variant) has the following in /sys/block

alpha:~# ls -l /sys/block/
total 0
drwxr-xr-x  5 root root 0 2007-12-08 08:55 fd0
drwxr-xr-x  6 root root 0 2007-12-08 08:55 hde
drwxr-xr-x  5 root root 0 2007-12-08 08:55 hdf
drwxr-xr-x 10 root root 0 2007-12-08 08:55 sda
drwxr-xr-x  9 root root 0 2007-12-08 08:55 sdb
alpha:~# ls -l /sys/block/sda
total 0
-r--r--r-- 1 root root 8192 2007-12-08 08:55 capability
-r--r--r-- 1 root root 8192 2007-12-08 08:55 dev
lrwxrwxrwx 1 root root    0 2007-12-08 08:55 device -> 
../../devices/pci0001:01/0001:01:06.0/host0/target0:0:1/0:0:1:0
drwxr-xr-x 2 root root    0 2007-12-08 08:55 holders
drwxr-xr-x 3 root root    0 2007-12-08 08:55 queue
-r--r--r-- 1 root root 8192 2007-12-08 08:55 range
-r--r--r-- 1 root root 8192 2007-12-08 08:55 removable
drwxr-xr-x 3 root root    0 2007-12-08 08:55 sda1
drwxr-xr-x 3 root root    0 2007-12-08 08:55 sda2
drwxr-xr-x 3 root root    0 2007-12-08 08:55 sda3
drwxr-xr-x 3 root root    0 2007-12-08 08:55 sda4
drwxr-xr-x 3 root root    0 2007-12-08 08:55 sda5
-r--r--r-- 1 root root 8192 2007-12-08 08:55 size
drwxr-xr-x 2 root root    0 2007-12-08 08:55 slaves
-r--r--r-- 1 root root 8192 2007-12-08 08:55 stat
lrwxrwxrwx 1 root root    0 2007-12-08 08:55 subsystem -> ../../block
--w------- 1 root root 8192 2007-12-08 08:55 uevent
alpha:~# grep . /sys/block/sda/*/dev
/sys/block/sda/sda1/dev:8:1
/sys/block/sda/sda2/dev:8:2
/sys/block/sda/sda3/dev:8:3
/sys/block/sda/sda4/dev:8:4
/sys/block/sda/sda5/dev:8:5



The broken kernel (2.6.24-rc3) has the following in /sys/block

alpha:~# ls -l /sys/block/
total 0
drwxr-xr-x  5 root root 0 Dec  8 09:22 fd0
drwxr-xr-x  6 root root 0 Dec  8 09:22 hde
drwxr-xr-x  5 root root 0 Dec  8 09:23 hdf
drwxr-xr-x 10 root root 0 Dec  8 09:22 sda
drwxr-xr-x  9 ...
From: Kay Sievers
Date: Friday, December 7, 2007 - 2:19 pm

Yeah, that looks all fine.

What distro is that, and what's the udev version?

You are booting your kernel with an initramfs?

Is the udev daemon (still) running while it fails?

If you run /sbin/udevtrigger, do the nodes appear?

Kay

--

From: Bob Tracy
Date: Friday, December 7, 2007 - 3:39 pm

Mine is Debian Etch, normally with the latest released or -rcX kernel
from kernel.org.  Updates current as of about 18 hours ago.  Udev
package version is 0.105-4.  The RELEASE-NOTES file in /usr/share/doc/udev


I can answer the above later when I'm back in front of the machine, but
even in the "not good" case, I still see the following messages from
the /etc/rcS.d/S03udev file:

	Starting the hotplug events dispatcher udevd.
	Synthesizing the initial hotplug events.

This is where udevtrigger gets called, followed by the load_input_modules
and create_dev_makedev functions, then...

	Waiting for /dev to be fully populated.

which is where udevsettle gets called.

None of the above appear to be exiting abnormally for the bad case, but
I'll definitely take a closer look at what MAKEDEV (/dev/MAKEDEV -->
/sbin/MAKEDEV) is doing.  In particular, Debian MAKEDEV is looking at
/proc/devices to decide what to do, so maybe "cat /proc/devices" would
be useful to look at for the broken case.

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an elephant at this dist- "
rct@frus.com       |   - Last words of Union General John Sedgwick,
                   |  Battle of Spotsylvania Court House, U.S. Civil War
------------------------------------------------------------------------
--

From: Bob Tracy
Date: Friday, December 7, 2007 - 10:05 pm

Yes, and there's something else I forgot to mention that may be
significant...  For the bad case, in addition to udevd, "ps -ef"
shows a "sh -e /lib/udev/net.agent" running with a PPID of 1.  This
process doesn't exit until I reboot.  If this is normal under the
circumstances, please disregard.

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an elephant at this dist- "
rct@frus.com       |   - Last words of Union General John Sedgwick,
                   |  Battle of Spotsylvania Court House, U.S. Civil War
------------------------------------------------------------------------
--

From: Kay Sievers
Date: Saturday, December 8, 2007 - 8:48 am

Does SysRq-T show where it hangs?

Kay

--

From: Michael Cree
Date: Saturday, December 8, 2007 - 5:51 pm

Ummm... No.  I didn't have the CONFIG_MAGIC_SYSRQ flag set, so I set it, 
and recompiled the kernel.  Guess what - now the system comes up 
normally without any problem.  The block devices appear in /dev.  To 
recap: without CONFIG_MAGIC_SYSRQ on the 2.6.24-rc3 kernel the missing 
block devices error in /dev occurs and the init scripts fall over on 
startup, and with CONFIG_MAGIC_SYSRQ the system comes up normally.

To answer the earlier questions about distro, and udev version, my 
system is similar to Bob's, except that I am running Debian 
testing/lenny which comes with udev version 114 (dpkg reports udev 
version 0.114-2).  I am running an EV67 variant CPU.

I do not run an initramfs - I have the necessary drivers for the various 
discs compiled into the kernel and use the root kernel option to point 
to the required root partition.

When running the broken kernel udev is running (according to 'ps') and 
executing /sbin/udevtrigger manually generates a number of errors of the 
form:

scsi_id[<pid>]: scsi_id: unable to access '/block'

The missing /dev/* entries do not appear.

Cheerz
Michael.
--

From: Ivan Kokshaysky
Date: Sunday, December 9, 2007 - 11:07 am

Incredible...

Toggling CONFIG_MAGIC_SYSRQ works for me too, so I'm finally able
to reproduce the problem (which is the main positive result so far ;-)

There are lots of possible reasons why this happens, but at the
moment I honestly have no idea.
For now I have reassigned the bug #9457 to myself and will gradually hack
into udev...

Ivan.
--

From: Bob Tracy
Date: Monday, December 10, 2007 - 8:08 am

Thanks...  Let me know if there's anything useful I can do to help.

--Bob T.
--

From: Ivan Kokshaysky
Date: Monday, December 10, 2007 - 4:12 pm

It turns out to be yet another strncpy() bug that indeed shows up only with
certain src/dst alignments and breaks kobject_get_path(). Ugh...

Hopefully I'll have a patch tomorrow.

Ivan.
--

From: Bob Tracy
Date: Thursday, December 6, 2007 - 10:42 pm

Verified that 6f37ac793d6ba7b35d338f791974166f67fdd9ba is the next
commit after the "good" kernel I'm running now.  The build is running,
and I should have an answer for us in a few hours.

-- 
------------------------------------------------------------------------
Bob Tracy          |  "They couldn't hit an elephant at this dist- "
rct@frus.com       |   - Last words of Union General John Sedgwick,
                   |  Battle of Spotsylvania Court House, U.S. Civil War
------------------------------------------------------------------------
--

From: Ingo Molnar
Date: Friday, December 7, 2007 - 2:33 am

the bisection log looks healthy so far - with nicely alternating 
good/bad bisection points. Barring the possibility that the bug is 
non-deterministic, i'd guess the bisection points are OK, at least 
judging from their statistical properties.

but ... i went over the diffs too, and i fail to see how they could 
affect the bootup path of an Alpha box, which i suspect has no 
networking dependency up to the failure point.

	Ingo
--

From: Rafael J. Wysocki
Date: Thursday, December 6, 2007 - 5:44 pm

Previous thread: increased number of cycles by kernel coder on Saturday, November 17, 2007 - 10:31 pm. (4 messages)

Next thread: [PATCH]new_TSC_based_delay_tsc() by Marin Mitov on Sunday, November 18, 2007 - 2:20 am. (2 messages)