Page MenuHomePhabricator

db2088 rebooted itself and came back sick
Closed, ResolvedPublic

Description

At about Aug 26 00:50:01, a bunch of pages fired for db2088:

PROBLEM - Host db2088 is DOWN: PING CRITICAL - Packet loss = 100%
7:55 PM RECOVERY - Host db2088 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms
7:57 PM PROBLEM - MariaDB Slave IO: s2 on db2088 is CRITICAL: CRITICAL slave_io_state could not connect
7:57 PM PROBLEM - MariaDB Slave SQL: s1 on db2088 is CRITICAL: CRITICAL slave_sql_state could not connect
7:57 PM PROBLEM - MariaDB Slave SQL: s2 on db2088 is CRITICAL: CRITICAL slave_sql_state could not connect
7:58 PM PROBLEM - MariaDB read only s2 on db2088 is CRITICAL: Could not connect to localhost:3312
7:58 PM PROBLEM - MariaDB read only s1 on db2088 is CRITICAL: Could not connect to localhost:3311
7:58 PM PROBLEM - mysqld processes on db2088 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
7:58 PM PROBLEM - MariaDB Slave IO: s1 on db2088 is CRITICAL: CRITICAL slave_io_state could not connect
8:05 PM PROBLEM - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag could not connect
8:05 PM PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag could not connect

It seems to have come back up, but it's in a very strange state. In particular, the syslog contains no messages at all post-crash. If this is standard systemd behavior, it's new to me.

In theory this box needs its mariadb services started, but we should figure out how broken the box is first.

Event Timeline

I don't see anything in the syslog to warn about the coming crash... it just stops dead at 00:50:01. Same for /var/log/messages.

Since this isn't an active db server (as far as I know nothing in codfw is), I'm going to let this be in case someone else wants to investigate. I predict that after a reboot it will mysteriously start working again.

Change 455383 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2088.yaml: Disable notifications

https://gerrit.wikimedia.org/r/455383

Change 455383 merged by Marostegui:
[operations/puppet@production] db2088.yaml: Disable notifications

https://gerrit.wikimedia.org/r/455383

Change 455384 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2088

https://gerrit.wikimedia.org/r/455384

Change 455384 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2088

https://gerrit.wikimedia.org/r/455384

Mentioned in SAL (#wikimedia-operations) [2018-08-26T05:53:38Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool db2088 - T202822 (duration: 00m 54s)

Marostegui triaged this task as Medium priority.
Marostegui added a project: ops-codfw.
Marostegui moved this task from Triage to In progress on the DBA board.

Thanks a lot for triaging this @Andrew.
HW logs look empty unfortunately but this crash looks really similar to the crashes we've had when the storage fails:

/admin1/system1/logs1/log1-> show record1

	properties
		CreationTimestamp = 20170330142514.000000-300
		ElementName = System Event Log Entry
		RecordData = Log cleared.
		RecordFormat = string Description
		RecordID = 1

The RAID looks good though:

root@db2088:/srv/sqldata.s1# megacli -LDPDInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.635 TB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 3.635 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 10
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
Number of Spans: 1
Span: 0 - Number of PDs: 10

So probably it could've been the RAID controller itself - not the first time we've seen it.

I have disabled notifications and depooled the host from codfw (it is indeed an active db server, but I have depooled it for consistency with out MW config).
Also, I have left MySQL services stopped - they'll need a data check after we start them up to make sure nothing is really broken.

@Papaul can we upgrade BIOS and firmware just in case this happens again and we need to contact the vendor? The server is depooled and MySQL stopped so we can do this anytime when you get to the DC.

Thanks!

Thanks @Andrew for taking the time, I owe you a drink of your preference next time we meet.

BIOS upgrade from version 2.4.3 to 2.8.0
IDRAC upgrade from version 2.40 to 2.60

Thanks Papaul.

I have upgraded kernel, mysql and started it.
Once it has caught up I will do a data check before repooling it.

I have started to compare the main tables for s1 and s2 as the server caught up already

s1 finished checking - all good.

Going to repool this host now.

Change 456072 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2088.yaml: Enable notifications

https://gerrit.wikimedia.org/r/456072

Change 456072 merged by Marostegui:
[operations/puppet@production] db2088.yaml: Enable notifications

https://gerrit.wikimedia.org/r/456072

Mentioned in SAL (#wikimedia-operations) [2018-08-29T05:46:36Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool db2088 - T202822 (duration: 00m 55s)