[SOLVED] [hardware] Initramfs mdadm assembly woes, udev not triggered?

steve_v · 2020-08-29 19:18:01

At some point recently (I not sure when, it's headless so I rarely see early boot messages), a machine of mine (beowulf) began spewing:

mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: error opening /dev/md?*: No such file or directory
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.
mdadm: No devices listed in conf file were found.

From the initrd during boot.

Curiously, the arrays appear to come up correctly, the root device (RAID1) is found and mounted, the system boots, and apart from those messages everything seems fine...

md0 : active raid1 sdl1[3] sdk1[2] <----- /boot
2095040 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdl3[3] sdk3[2] <----- /
105775104 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda[0] <----- Not currently mounted, pending disk replacement.
117154240 blocks super 1.2 [2/1] [U_]
bitmap: 1/1 pages [4KB], 65536KB chunk

Array uuids all match as they should, I generated a new mdadm.conf and initramfs anyway, the problem persists...

Dropping into the initramfs shell with break=mount reveals something rather odd:
/proc/partitions is empty. No wonder mdadm can't find any devices.

A quick poke about in scripts/local-block/mdadm (booting with "text" on the command line reveals local-block as the source of the squealing):

#!/bin/sh

PREREQ="multipath"

prereqs()
{
        echo "$PREREQ"
}

case $1 in
# get pre-requisites
prereqs)
        prereqs
        exit 0
        ;;
esac

. /scripts/functions

# Poor man's mdadm-last-resort@.timer
# That kicks in 2/3rds into the ROOTDELAY

if [ ! -f /run/count.mdadm.initrd ]
then
    COUNT=0

    # Unfortunately raid personalities can be registered _after_ block
    # devices have already been added, and their rules processed, try
    # triggering again.  See #830770
    udevadm trigger --action=add -s block || true
    wait_for_udev 10
else
    COUNT=$(cat /run/count.mdadm.initrd)
fi
COUNT=$((COUNT + 1))

echo $COUNT > /run/count.mdadm.initrd

# Run pure assemble command, even though we default to incremental
# assembly it is supported for users to export variables via
# param.conf such as IMSM_NO_PLATFORM.  See #830300
mdadm -q --assemble --scan --no-degraded || true

MAX=30
if [ ${ROOTDELAY:-0} -gt $MAX ]; then
    MAX=$ROOTDELAY
fi
MAX=$((MAX*2/3))

if [ "$COUNT" = "$MAX" ]
then
    # Poor man's mdadm-last-resort@.service for incremental devices
    mdadm -q --run /dev/md?*

    # And last try for all others
    mdadm -q --assemble --scan --run

    rm -f /run/count.mdadm.initrd
fi

exit 0

Sure enough, 'mdadm -q --assemble --scan --no-degraded' and 'mdadm -q --run /dev/md?*' spit out the same errors I'm seeing in a normal (for some definition of normal) boot.
Running 'udevadm trigger --action=add -s block' populates /proc/partitions, after which mdadm is happy and all is well.

I'm not particularly familiar with the initramfs scripts, but it looks to me like that command should be run before mdadm tries to scan for devices? Right?

I did try rootdelay=10, but that makes no difference, and I'm not getting anything useful from the 'net at large on this one.

Any idea what's going on here, or where I should be looking?

Last edited by steve_v (2020-09-06 01:58:16)

fsmithred · 2020-08-29 21:29:02

Maybe this bug: https://bugs.devuan.org/cgi/bugreport.cgi?bug=483

You could try adding 'sleep 1' to /etc/init.d/eudev as indicated in message #10 or #20 or apply the patch in the last message.

steve_v · 2020-08-31 23:25:43

fsmithred wrote:

You could try adding 'sleep 1' to /etc/init.d/eudev as indicated

I could, but all this is going on in the initrd before init starts or /etc/init.d is available...

I'll do the same to the initrd script that starts udevd and see what happens, but it'll have to wait a little as I can't really take this box offline right now.
It occurs to me that there were also some hardware changes which might have fallen in the "I just noticed it" window, and those could probably be reverted temporarily.

When I get a window long enough to risk borking the boot process and reverting backups without undue screaming, I'll have another poke about.

steve_v · 2020-09-06 01:56:33

Okay, I think we can close this one. It's hardware.

The counterpart to that failed drive was taking 27s to initialise and appear on the bus. It's certainly not ideal for the initrd to spam scary errors instead of backing off and retrying, but it's not his fault either.

[rant]
Moral of the story: The Crucial BX500 series SSDs are utter crap.

Of that pair of SSDs, one is dead as a doornail (doesn't register on the bus), the other is clearly dying.
Both are less than a year old and still under warranty, but I'll be throwing them in the trash rather than returning them - the last thing I want is more of the same.
Both have history of hitting ~70c and thermal throttling under sustained write loads, despite the enclosure and the drives directly above them never breaking 26c.
The nasty plastic casing never even gets warm, and now that I've dissected one I see somebody at crucial thinks putting a thermal pad on the controller is a luxury. 70c is listed as max operating and they never exceeded it, but I'll bet a cookie heat is why they died.

Crucial, your budget SSD line gets a solid F from me. The Kingston A400 series is not only cheaper, it's also built properly and doesn't constantly try to cook itself.
I wanted budget SSDs for that filesystem, and expected budget performance. What I didn't expect was something so shitty it can't even sustain the already mediocre performance numbers without overheating.
FWIW, a BX500 reporting 70c writes at ~7MB/s.
[/rant]

brocashelm · 2020-09-06 07:50:54

steve_v wrote:

[rant]
Moral of the story: The Crucial BX500 series SSDs are utter crap.
Of that pair of SSDs, one is dead as a doornail (doesn't register on the bus), the other is clearly dying.
Both are less than a year old and still under warranty, but I'll be throwing them in the trash rather than returning them - the last thing I want is more of the same.
Both have history of hitting ~70c and thermal throttling under sustained write loads, despite the enclosure and the drives directly above them never breaking 26c.
The nasty plastic casing never even gets warm, and now that I've dissected one I see somebody at crucial thinks putting a thermal pad on the controller is a luxury. 70c is listed as max operating and they never exceeded it, but I'll bet a cookie heat is why they died.
Crucial, your budget SSD line gets a solid F from me. The Kingston A400 series is not only cheaper, it's also built properly and doesn't constantly try to cook itself.
I wanted budget SSDs for that filesystem, and expected budget performance. What I didn't expect was something so shitty it can't even sustain the already mediocre performance numbers without overheating.
FWIW, a BX500 reporting 70c writes at ~7MB/s.
[/rant]

That's disappointing. I actually own a couple of Crucial MX500s, and they've yet to disappoint me in the two years I've been using them. Temperature spikes have never been an issue, and it's been a pretty bad summer in my area. The HP SSDs I've owned, however, were already having multiple bad sectors within a few months of using them. Now, those are the cheaply produced SSDs (not surprised, since HP is a shit brand).

I might try out a Kingston SSD next time. I do use their HyperX Fury line for DDR3 memory, which is pretty nice and fast.

steve_v · 2020-09-06 08:44:05

brocashelm wrote:

I actually own a couple of Crucial MX500s, and they've yet to disappoint me in the two years I've been using them.

MX is the mid-range line, and I haven't heard of any problems with them either.

brocashelm wrote:

Temperature spikes have never been an issue, and it's been a pretty bad summer in my area.

This was no spike, this was pegging at 60-70c continuously, and it only took about 50GB of sustained sequential write to get there.
I probably got ~5x drive capacity of write endurance total before the first one croaked.

brocashelm wrote:

HP is a shit brand

HP make pretty decent server-grade stuff, but their consumer products are total garbage, always have been.
That said, those SSDs were almost certainly OEM jobs, and good luck figuring out who really made them.

brocashelm wrote:

I might try out a Kingston SSD next time.

I have several (including a pair being thrashed as cache drives), and I have absolutely no problems to report.
Even the DRAM-less budget models seem to be okay, if understandably slow-as molasses. And yes, even the very cheapest have metal cases and thermal pads.
Just be aware that they're one of the (several) brands known to lie about performance, if stated speeds seem too good for the price, that's exactly what they are.

Last edited by steve_v (2020-09-06 08:45:22)

Marjorie · 2020-09-06 11:34:25

Horses for courses. If a drive is rated for 70C I'd not want to load it to run at that continuously.
I have a BX500 myself in my email server but its very lightly, if continuously, loaded and currently sits at 28C (max 31C). Hopefully it will continue to provide good service. Maybe I should add disk monitoring (currently I just monitor the CPU).

steve_v · 2020-09-06 13:30:27

Marjorie wrote:

Horses for courses. If a drive is rated for 70C I'd not want to load it to run at that continuously.

I fully expect any drive to be able to operate at 100% load continuously, provided it's external environmental limits are maintained.
It's over to the manufacturer to ensure that internal components don't overheat in said external environment.

The manual does not say "never write more than 100GB/hr", nor does it say "use only in LN2 cooled enclosures". It just boasts about transfer speeds, and those are mostly lies anyway.
A drive that overheats under load even when more than adequately cooled is a pile of junk, end of story. Even with a plastic case, those SSDs more than likely would have survived had crucial spent the few cents to add a thermal transfer pad.

These were in a fan-cooled aluminium drive cage, and that cage was maintaining a ~25c environment. The case of the SSD itself couldn't have been much above that.
Other drives in the same cage, including an OCZ model from 2013 and possibly the cheapest Kingston ever made, have never exceeded 35c under exactly the same workload.

I wasn't writing to them at anywhere near the fictitious "up to 500MB/s" specification either, not even 1/4 that in fact.
All I was doing was moving 100GB-ish batches of ~5MB files to the drives and moving them off again. Not "typical desktop usage", but certainly nothing I wouldn't expect any old drive to handle.

In short, it's Crucial's (complete lack of) internal thermal design that killed these, not my usage.
They're literally just a plastic box with a bare PCB rattling around inside. No screws, no heatsinking, nothing. Not even a bit of extra copper on the board for thermal mass.
I own USB3 pen drives that have thermal pads on the chips FFS.

Marjorie wrote:

Maybe I should add disk monitoring

Smartd is good for many things, temperature included.

The officially official Devuan Forum!

#1 2020-08-29 19:18:01

[SOLVED] [hardware] Initramfs mdadm assembly woes, udev not triggered?

#2 2020-08-29 21:29:02

Re: [SOLVED] [hardware] Initramfs mdadm assembly woes, udev not triggered?

#3 2020-08-31 23:25:43

Re: [SOLVED] [hardware] Initramfs mdadm assembly woes, udev not triggered?

#4 2020-09-06 01:56:33

Re: [SOLVED] [hardware] Initramfs mdadm assembly woes, udev not triggered?

#5 2020-09-06 07:50:54

Re: [SOLVED] [hardware] Initramfs mdadm assembly woes, udev not triggered?

#6 2020-09-06 08:44:05

Re: [SOLVED] [hardware] Initramfs mdadm assembly woes, udev not triggered?

#7 2020-09-06 11:34:25

Re: [SOLVED] [hardware] Initramfs mdadm assembly woes, udev not triggered?

#8 2020-09-06 13:30:27

Re: [SOLVED] [hardware] Initramfs mdadm assembly woes, udev not triggered?

Board footer