Discussion:
[Beowulf] Defective Mellanox EDR Switches
Ryan Novosielski
2018-06-07 21:41:05 UTC
Permalink
On Thu, 7 Jun 2018 03:12:43 +0000
One slight correction: 100% of our switches with FRU PN 00WE097/PN
00WE096Y manufactured on 2016-11-28 (quantity 3) have failed, and one
same FRU PN/PN manufactured on 2016-12-15 too. We have another switch
with FRU PN 00WE093/PN 00WE092Y that was manufactured on 2016-11-28
that has so far been OK, but I’m now suspicious of it.
Thanks for the heads up.
To make this data point more valuable, can you add total numbers? That
is, how many (similar) switches in total, how many bad/good. And for
how long did they run before exhibiting the problem.
Sure, Peter.
We only have 6 SB7890 switches currently. All were purchased through Lenovo, and all have Lenovo machine types of 0724-HD6. I don’t think this has much to do with Lenovo, though, apart from reselling them. One of the four that failed is actually a replacement for a physically damaged switch (bad port latch), so that means there is even bad replacement inventory out there. All of the 4 aforementioned 00WE096Y switches have failed, 3 manufactured on 2018-11-28 and 1 manufactured on 2016-12-15. I don’t have an exact date for the first failure or the switch installation, but the failure occurred roughly a year after the manufacturing date. My guess is that they were in service for about 9-10 months, but I can probably narrow that down with a little more effort if it matters.
The other two are FRU PN 00WE093/PN 00WE92Y (Lenovo MT 0724-HD5). So far so good on those, though I’m now suspicious of the one manufactured on 2016-11-28.
Additionally, we have two SB7800 switches — FRU PN: 00WE085/PN 00WE084Y. Too new to tell on those — only a few weeks in service. Both were manufactured on 2018-01-08.
Upshot is an advance RMA on anything that has already shown symptoms (so 3 switches for us), and an on-site visit with a software fix to be applied to all other SB7800-class switches; they seem to think all are potentially affected. The sort of timeframe they gave is ~45 minutes of work on our 11 units.

Fun.

--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - ***@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
Kilian Cavalotti
2018-06-08 05:14:27 UTC
Permalink
Although I don't have the specifics at hand right now, I can confirm that
we've observed the same thing in our installation as well: a couple SB7890
switches that exhibited the same symptoms, after about one year in
production.

We've also seen one SB7800 (ie. managed) failing in the same way. And one
SB7890 that would revert all its ports links to FDR after a few hours.

As a comparison point, we had zero failure on a 48-switch FDR fabric that
has been in production 4 years.

Cheers,
--
Kilian
On Thu, 7 Jun 2018 03:12:43 +0000
One slight correction: 100% of our switches with FRU PN 00WE097/PN
00WE096Y manufactured on 2016-11-28 (quantity 3) have failed, and one
same FRU PN/PN manufactured on 2016-12-15 too. We have another switch
with FRU PN 00WE093/PN 00WE092Y that was manufactured on 2016-11-28
that has so far been OK, but I’m now suspicious of it.
Thanks for the heads up.
To make this data point more valuable, can you add total numbers? That
is, how many (similar) switches in total, how many bad/good. And for
how long did they run before exhibiting the problem.
Sure, Peter.
We only have 6 SB7890 switches currently. All were purchased through
Lenovo, and all have Lenovo machine types of 0724-HD6. I don’t think this
has much to do with Lenovo, though, apart from reselling them. One of the
four that failed is actually a replacement for a physically damaged switch
(bad port latch), so that means there is even bad replacement inventory out
there. All of the 4 aforementioned 00WE096Y switches have failed, 3
manufactured on 2018-11-28 and 1 manufactured on 2016-12-15. I don’t have
an exact date for the first failure or the switch installation, but the
failure occurred roughly a year after the manufacturing date. My guess is
that they were in service for about 9-10 months, but I can probably narrow
that down with a little more effort if it matters.
The other two are FRU PN 00WE093/PN 00WE92Y (Lenovo MT 0724-HD5). So far
so good on those, though I’m now suspicious of the one manufactured on
2016-11-28.
Additionally, we have two SB7800 switches — FRU PN: 00WE085/PN 00WE084Y.
Too new to tell on those — only a few weeks in service. Both were
manufactured on 2018-01-08.
Upshot is an advance RMA on anything that has already shown symptoms (so 3
switches for us), and an on-site visit with a software fix to be applied to
all other SB7800-class switches; they seem to think all are potentially
affected. The sort of timeframe they gave is ~45 minutes of work on our 11
units.
Fun.
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Loading...