Page MenuHomePhabricator

cp3048 hardware issues
Closed, DuplicatePublic

Description

cp3048 is out of warranty and has a possibly broken mainboard. It's poweed off and has been used as a "hardware donor" for other OOW hosts in esams.

Decom check list from wikitech:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp (replace with role::spare::system if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

Event Timeline

BBlack triaged this task as Medium priority.Mar 24 2018, 2:52 AM
BBlack created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

It probably crashed today at 2018-04-12 13:31:20, hardware logs should be checked.

As Chris confirmed, this is either due to CPU or memory.

Since it's a dual CPU machine, he suggests to swap the two CPUs and see if the error follows the swap.

More recently:

-------------------------------------------------------------------------------
Record:      105
Date/Time:   05/09/2018 22:48:53
Source:      system
Severity:    Critical
Description: The system board PS2 PG Fail voltage is outside of range.
-------------------------------------------------------------------------------
Record:      106
Date/Time:   05/16/2018 10:55:57
Source:      system
Severity:    Critical
Description: CPU 1 has an internal error (IERR).
-------------------------------------------------------------------------------

Which means it's most likely the mainboard that needs replacement.

Either way, this system is out of warranty. I may sacrifice it and use it as a source of parts (SSD, DIMMs) for other broken machines...

Mentioned in SAL (#wikimedia-traffic) [2018-07-04T11:30:17Z] <ema> shutdown cp3048 and cp3034 (both already depooled) for hardware maintenance T190607 T189305

Because cp3048 is out of warranty and it's unlikely we can get it fixed, I've used it as a parts donor for other broken systems:

  • Its second drive (sdb) has been taken for use in cp3043:sdb, the drive carrier is now empty
  • Its B3 DIMM module has been swapped with cp3034 B3, and it may be bad (or may work now after essentially a reseat)
  • CPU1 is still likely having issues, may or may not be due to power supply by the main board.

Change 444169 had a related patch set uploaded (by Ema; owner: Muehlenhoff):
[operations/puppet@production] Remove cp3048 from Hiera/conftool

https://gerrit.wikimedia.org/r/444169

Change 444169 merged by Muehlenhoff:
[operations/puppet@production] Remove cp3048 from Hiera/conftool

https://gerrit.wikimedia.org/r/444169

Change 444171 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove cp3048 from site.pp/DHCP config

https://gerrit.wikimedia.org/r/444171

Change 444171 merged by Muehlenhoff:
[operations/puppet@production] Remove cp3048 from site.pp/DHCP config

https://gerrit.wikimedia.org/r/444171

Change 444560 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/dns@master] Remove cp3048 prod DNS entries

https://gerrit.wikimedia.org/r/444560

Change 444560 merged by Muehlenhoff:
[operations/dns@master] Remove cp3048 prod DNS entries

https://gerrit.wikimedia.org/r/444560