[Beowulf] RAID5 rebuild, remount with write without reboot?

Discussion:

mathog

2017-09-05 17:28:03 UTC

Short form:

An 8 disk (all 2Tb SATA) RAID5 on an LSI MR-USAS2 SuperMicro controller
(lspci shows " LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon]")
system was long ago configured with a small partition of one disk as
/boot and logical volumes for / (root) and /home on a single large
virual drive on the RAID. Due to disk problems and a self goal (see
below) the array went into a degraded=1 state (as reported by megacli)
and write locked both root and home. When the failed disk was replaced
and the rebuild completed those were both still write locked. "mount
-a" didn't help in either case. A reboot brought them up normally but
ideally that should not have been necessary. Is there a method to
remount the logical volumes writable that does not require a reboot?

Long form:

Periodic testing of the disks inside this array turned up pending
sectors with
this command:

smartctl -a /dev/sda -d sat+megaraid,7

A replacement disk was obtained and the usual replacement method
applied:

megacli -pdoffline -physdrv[64:7] -a0
megacli -pdmarkmissing -physdrv[64:7] -a0
megacli -pdprprmv -physdrv[64:7] -a0
megacli -pdlocate -start -physdrv[64:7] -a0

The disk with the flashing light was physically swapped. The smartctl
was run again and unfortunately its values were unchanged. I had always
assumed that the "7" in that smartctl was a physical slot, turns out
that it is actually the "Device ID". In my defense the smartctl man
page does a very poor job describing this:

megaraid,N - [Linux only] the device consists of one or more SCSI/SAS
disks
connected to a MegaRAID controller. The non-negative integer N
(in
the range of 0 to 127 inclusive) denotes which disk on the controller
is monitored. Use syntax such as:

In this system, unlike the others I had worked on previously, Device ID
and
slots were not 1:1.

Anyway, about a nanosecond after this was discovered the disk at Device
ID 7 was marked as Failed by the controller whereas previously it had
been "Online, Spun Up".
Ugh. At that point the logical volumes were all set read only and the OS
became barely usable, with commands like "more" no longer functioning.
Megacli and sshd, thankfully, still worked. Figuring that I had nothing
to lose the replacement disk was removed from slot 7 and the original,
hopefully still good disk replaced. That put the system into this
state.

slot 4 (device ID 7) failed.
slot 7 (device ID 5) is Offline.

and

megacli -PDOnline -physdrv[64:7] -a0

put it at

slot 4 (device ID 7) failed.
slot 7 (device ID 5) Online, Spun Up

The logical volumes were still read only but "more" and most other
commands now worked again. Megacli still showed the "degraded" value as
1. I'm still not clear
how the two "read only" states differed to cause this change.

At that point the failed disk in slot 4 (not 7!) was replaced with the
new disk (which had been briefly in slot 7) and it immediately began to
rebuild. Something on the order of 48 hours later that rebuild
completed, and the controller set "degraded" back to 0. However, the
logical volumes were still readonly. "mount -a" didn't fix it, so the
system was rebooted, which worked.

We have two of these back up systems. They are supposed to have
identical contents but do not. Fixing that is another item on a long
todo list. RAID 6 would have been a better choice for this much
storage, but it does not look like this card supports it:

RAID0, RAID1, RAID5, RAID00, RAID10, RAID50, PRL 11, PRL 11 with
spanning,
SRL 3 supported, PRL11-RLQ0 DDF layout with no span,
PRL11-RLQ0 DDF layout with span

That rebuild is far too long for comfort. Had another disk failed in
those two days that would have been it. Neither controller has battery
backup, and the one in question is not even on a UPS, so a power glitch
could be fatal too. Not a happy thought while record SoCal temperatures
persisted throughout the entire rebuild! The systems are in different
buildings on the same campus, sharing the same power grid. There are no
other backups for most of this data.

Even though the controller shows this system as no longer degraded,
should I believe that there was no data loss? I can run checksums on
all the files (even though it will take forever) and compare the two
systems. But as I said previously, the files were not entirely 1:1, so
there are certainly going to be some files on this system which have no
match on the other.

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit h

John Hearns via Beowulf

2017-09-05 17:43:14 UTC

Permalink

David, I have never been in that situation. However I have configured my
fair share of LSI controllers so I share your pain!
(I reserve my real tears for device mapper RAID).

How about a mount -o remount Did you try that before rebooting?

I am no expert here - in the past when I have had non-RAID systems which
have had disks go read-only the only cure is a reboot.
Someone who is more familiar wiht how the kernel behaves when it has
decided that a device is not writeable should please correct me.
I would guess that a rescan-scsi-bus would have no effect - the disk is
still there!

Post by mathog
An 8 disk (all 2Tb SATA) RAID5 on an LSI MR-USAS2 SuperMicro controller
(lspci shows " LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon]")
system was long ago configured with a small partition of one disk as /boot
and logical volumes for / (root) and /home on a single large virual drive
on the RAID. Due to disk problems and a self goal (see below) the array
went into a degraded=1 state (as reported by megacli) and write locked both
root and home. When the failed disk was replaced and the rebuild completed
those were both still write locked. "mount -a" didn't help in either
case. A reboot brought them up normally but ideally that should not have
been necessary. Is there a method to remount the logical volumes writable
that does not require a reboot?
Periodic testing of the disks inside this array turned up pending sectors
with
smartctl -a /dev/sda -d sat+megaraid,7
megacli -pdoffline -physdrv[64:7] -a0
megacli -pdmarkmissing -physdrv[64:7] -a0
megacli -pdprprmv -physdrv[64:7] -a0
megacli -pdlocate -start -physdrv[64:7] -a0
The disk with the flashing light was physically swapped. The smartctl was
run again and unfortunately its values were unchanged. I had always
assumed that the "7" in that smartctl was a physical slot, turns out that
it is actually the "Device ID". In my defense the smartctl man page does a
megaraid,N - [Linux only] the device consists of one or more SCSI/SAS
disks
connected to a MegaRAID controller. The non-negative integer N (in
the range of 0 to 127 inclusive) denotes which disk on the controller
In this system, unlike the others I had worked on previously, Device ID and
slots were not 1:1.
Anyway, about a nanosecond after this was discovered the disk at Device ID
7 was marked as Failed by the controller whereas previously it had been
"Online, Spun Up".
Ugh. At that point the logical volumes were all set read only and the OS
became barely usable, with commands like "more" no longer functioning.
Megacli and sshd, thankfully, still worked. Figuring that I had nothing to
lose the replacement disk was removed from slot 7 and the original,
hopefully still good disk replaced. That put the system into this state.
slot 4 (device ID 7) failed.
slot 7 (device ID 5) is Offline.
and
megacli -PDOnline -physdrv[64:7] -a0
put it at
slot 4 (device ID 7) failed.
slot 7 (device ID 5) Online, Spun Up
The logical volumes were still read only but "more" and most other
commands now worked again. Megacli still showed the "degraded" value as
1. I'm still not clear
how the two "read only" states differed to cause this change.
At that point the failed disk in slot 4 (not 7!) was replaced with the
new disk (which had been briefly in slot 7) and it immediately began to
rebuild. Something on the order of 48 hours later that rebuild completed,
and the controller set "degraded" back to 0. However, the logical volumes
were still readonly. "mount -a" didn't fix it, so the system was rebooted,
which worked.
We have two of these back up systems. They are supposed to have identical
contents but do not. Fixing that is another item on a long todo list.
RAID 6 would have been a better choice for this much storage, but it does
RAID0, RAID1, RAID5, RAID00, RAID10, RAID50, PRL 11, PRL 11 with
spanning,
SRL 3 supported, PRL11-RLQ0 DDF layout with no span,
PRL11-RLQ0 DDF layout with span
That rebuild is far too long for comfort. Had another disk failed in
those two days that would have been it. Neither controller has battery
backup, and the one in question is not even on a UPS, so a power glitch
could be fatal too. Not a happy thought while record SoCal temperatures
persisted throughout the entire rebuild! The systems are in different
buildings on the same campus, sharing the same power grid. There are no
other backups for most of this data.
Even though the controller shows this system as no longer degraded, should
I believe that there was no data loss? I can run checksums on all the
files (even though it will take forever) and compare the two systems. But
as I said previously, the files were not entirely 1:1, so there are
certainly going to be some files on this system which have no match on the
other.
Regards,
David Mathog
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Andrew Latham

2017-09-05 17:52:30 UTC

Permalink

Without a power cycle updating the drive firmware would be the only method
of tricking the drives into a power-cycle. Obviously very risky. A reboot
should be low risk.

--
- Andrew "lathama" Latham ***@gmail.com http://lathama.com
<http://lathama.org> -

Peter St. John

2017-09-05 17:58:51 UTC

Permalink

Aren't the drives in the RAID hot-swappable? Removing the defective drive
and installing a new one certainly cycled power on those two? But I'm weak
at hardware, and have never knowingly relied on firmware on a disk.

Post by Andrew Latham
Without a power cycle updating the drive firmware would be the only method
of tricking the drives into a power-cycle. Obviously very risky. A reboot
should be low risk.

Post by mathog
An 8 disk (all 2Tb SATA) RAID5 on an LSI MR-USAS2 SuperMicro controller
(lspci shows " LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon]")
system was long ago configured with a small partition of one disk as /boot
and logical volumes for / (root) and /home on a single large virual drive
on the RAID. Due to disk problems and a self goal (see below) the array
went into a degraded=1 state (as reported by megacli) and write locked both
root and home. When the failed disk was replaced and the rebuild completed
those were both still write locked. "mount -a" didn't help in either
case. A reboot brought them up normally but ideally that should not have
been necessary. Is there a method to remount the logical volumes writable
that does not require a reboot?
Periodic testing of the disks inside this array turned up pending sectors
with
smartctl -a /dev/sda -d sat+megaraid,7
megacli -pdoffline -physdrv[64:7] -a0
megacli -pdmarkmissing -physdrv[64:7] -a0
megacli -pdprprmv -physdrv[64:7] -a0
megacli -pdlocate -start -physdrv[64:7] -a0
The disk with the flashing light was physically swapped. The smartctl
was run again and unfortunately its values were unchanged. I had always
assumed that the "7" in that smartctl was a physical slot, turns out that
it is actually the "Device ID". In my defense the smartctl man page does a
megaraid,N - [Linux only] the device consists of one or more SCSI/SAS
disks
connected to a MegaRAID controller. The non-negative integer N (in
the range of 0 to 127 inclusive) denotes which disk on the controller
In this system, unlike the others I had worked on previously, Device ID and
slots were not 1:1.
Anyway, about a nanosecond after this was discovered the disk at Device
ID 7 was marked as Failed by the controller whereas previously it had been
"Online, Spun Up".
Ugh. At that point the logical volumes were all set read only and the OS
became barely usable, with commands like "more" no longer functioning.
Megacli and sshd, thankfully, still worked. Figuring that I had nothing to
lose the replacement disk was removed from slot 7 and the original,
hopefully still good disk replaced. That put the system into this state.
slot 4 (device ID 7) failed.
slot 7 (device ID 5) is Offline.
and
megacli -PDOnline -physdrv[64:7] -a0
put it at
slot 4 (device ID 7) failed.
slot 7 (device ID 5) Online, Spun Up
The logical volumes were still read only but "more" and most other
commands now worked again. Megacli still showed the "degraded" value as
1. I'm still not clear
how the two "read only" states differed to cause this change.
At that point the failed disk in slot 4 (not 7!) was replaced with the
new disk (which had been briefly in slot 7) and it immediately began to
rebuild. Something on the order of 48 hours later that rebuild completed,
and the controller set "degraded" back to 0. However, the logical volumes
were still readonly. "mount -a" didn't fix it, so the system was rebooted,
which worked.
We have two of these back up systems. They are supposed to have
identical contents but do not. Fixing that is another item on a long todo
list. RAID 6 would have been a better choice for this much storage, but it
RAID0, RAID1, RAID5, RAID00, RAID10, RAID50, PRL 11, PRL 11 with
spanning,
SRL 3 supported, PRL11-RLQ0 DDF layout with no span,
PRL11-RLQ0 DDF layout with span
That rebuild is far too long for comfort. Had another disk failed in
those two days that would have been it. Neither controller has battery
backup, and the one in question is not even on a UPS, so a power glitch
could be fatal too. Not a happy thought while record SoCal temperatures
persisted throughout the entire rebuild! The systems are in different
buildings on the same campus, sharing the same power grid. There are no
other backups for most of this data.
Even though the controller shows this system as no longer degraded,
should I believe that there was no data loss? I can run checksums on all
the files (even though it will take forever) and compare the two systems.
But as I said previously, the files were not entirely 1:1, so there are
certainly going to be some files on this system which have no match on the
other.
Regards,
David Mathog
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

--
<http://lathama.org> -
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Joe Landman

2017-09-05 18:06:54 UTC

Permalink

Generally the FW would write lock it. A

mount -o remount,rw $path

may not clear this. I've found that I need to often do something akin to

echo "- - -" > /sys/class/scsi_host/host0/scan

for each scsi host bus. Another thing to try is to remove the driver
and modprobe it again. However, as your /boot and / are on it, this
probably won't work well.

Reboot has this same effect though, so you did this sort of by default.

Regards,

Joe

Post by mathog
Periodic testing of the disks inside this array turned up pending
sectors with
smartctl -a /dev/sda -d sat+megaraid,7
megacli -pdoffline -physdrv[64:7] -a0
megacli -pdmarkmissing -physdrv[64:7] -a0
megacli -pdprprmv -physdrv[64:7] -a0
megacli -pdlocate -start -physdrv[64:7] -a0
The disk with the flashing light was physically swapped. The smartctl
was run again and unfortunately its values were unchanged. I had
always assumed that the "7" in that smartctl was a physical slot,
turns out that it is actually the "Device ID". In my defense the
megaraid,N - [Linux only] the device consists of one or more
SCSI/SAS disks
connected to a MegaRAID controller. The non-negative integer N (in
the range of 0 to 127 inclusive) denotes which disk on the controller
In this system, unlike the others I had worked on previously, Device
ID and
slots were not 1:1.
Anyway, about a nanosecond after this was discovered the disk at
Device ID 7 was marked as Failed by the controller whereas previously
it had been "Online, Spun Up".
Ugh. At that point the logical volumes were all set read only and the
OS became barely usable, with commands like "more" no longer
functioning. Megacli and sshd, thankfully, still worked. Figuring
that I had nothing to lose the replacement disk was removed from slot
7 and the original, hopefully still good disk replaced. That put the
system into this state.
slot 4 (device ID 7) failed.
slot 7 (device ID 5) is Offline.
and
megacli -PDOnline -physdrv[64:7] -a0
put it at
slot 4 (device ID 7) failed.
slot 7 (device ID 5) Online, Spun Up
The logical volumes were still read only but "more" and most other
commands now worked again. Megacli still showed the "degraded" value
as 1. I'm still not clear
how the two "read only" states differed to cause this change.
At that point the failed disk in slot 4 (not 7!) was replaced with the
new disk (which had been briefly in slot 7) and it immediately began
to rebuild. Something on the order of 48 hours later that rebuild
completed, and the controller set "degraded" back to 0. However, the
logical volumes were still readonly. "mount -a" didn't fix it, so the
system was rebooted, which worked.
We have two of these back up systems. They are supposed to have
identical contents but do not. Fixing that is another item on a long
todo list. RAID 6 would have been a better choice for this much
RAID0, RAID1, RAID5, RAID00, RAID10, RAID50, PRL 11, PRL 11 with
spanning,
SRL 3 supported, PRL11-RLQ0 DDF layout with no span,
PRL11-RLQ0 DDF layout with span
That rebuild is far too long for comfort. Had another disk failed in
those two days that would have been it. Neither controller has battery
backup, and the one in question is not even on a UPS, so a power
glitch could be fatal too. Not a happy thought while record SoCal
temperatures persisted throughout the entire rebuild! The systems are
in different buildings on the same campus, sharing the same power
grid. There are no other backups for most of this data.
Even though the controller shows this system as no longer degraded,
should I believe that there was no data loss? I can run checksums on
all the files (even though it will take forever) and compare the two
systems. But as I said previously, the files were not entirely 1:1,
so there are certainly going to be some files on this system which
have no match on the other.
Regards,
David Mathog
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

--
Joe Landman
e: ***@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit ht

mathog

2017-09-06 18:33:46 UTC

Permalink

Post by Joe Landman

Post by mathog
Is there a method to remount the logical volumes writable
that does not require a reboot?

Generally the FW would write lock it. A
mount -o remount,rw $path
may not clear this. I've found that I need to often do something akin
to
echo "- - -" > /sys/class/scsi_host/host0/scan
for each scsi host bus. Another thing to try is to remove the driver
and modprobe it again. However, as your /boot and / are on it, this
probably won't work well.

/boot wasn't, only / was.

Is there a difference between "mount -o remount" and "mount -a" if the
partions/logical volumes are already mounted "ro"? The system could
read
/etc/fstab which indicated that the mount should be rw. It isn't clear
to me from the man page what mount is supposed to do in that case.

Also, can anybody suggest what could possibly differ between

RAID degraded=1, logical volumes mounted ro, 1 disk failed, 1 disk
offline

and

RAID degraded=1, logical volumes mounted ro, RAID 1 disk failed, all
others online

such that "cat" and "more" did not work in the former but did work in
the latter, while "ls" and "megacli" worked in both?

At this point I have that just "happy to be walking away" feeling about
the whole incident.

Thanks,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit htt

Christopher Samuel

2017-09-10 23:47:43 UTC

Permalink

Post by mathog
Is there a difference between "mount -o remount" and "mount -a" if the
partions/logical volumes are already mounted "ro"?

I think mount -a will only try to mount filesystems that are not already
mounted.

-a, --all
Mount all filesystems (of the given types) mentioned in fstab.

[...]

remount
Attempt to remount an already-mounted filesystem. This is
commonly used to change the mount flags for a filesystem,
especially to make a readonly filesystem writeable.

Post by mathog
At this point I have that just "happy to be walking away"
feeling about the whole incident.

+1 :-)

Glad to hear you survived..

cheers,
Chris

--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: ***@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, ***@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/lis