Page MenuHomePhabricator

db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart
Closed, ResolvedPublic

Description

We have found an issue with db1061's server_id

Right now db1061 has a different server_id on mysql than the one on the config file.

root@db1061:~# cat /etc/my.cnf | grep server_id

server_id  = 171974883

root@db1061:~# mysql --skip-ssl -e "select @@server_id"

+-------------+

| @@server_id |

+-------------+

|   171978766 |

+-------------+

The live server_id config was generated from when db1061 had the IP: 10.64.48.14 (https://gerrit.wikimedia.org/r/#/c/117358/2/templates/wmnet)

However, at somepoint it was probably moved and got the new IP it has now, which generates the server_id value that is on my.cnf but probably puppet wasn't run before starting mysql and hence it got: 171978766

The problem comes when db1125 (sanitarium) has 10.64.48.14 and thus 171978766 as server_id, when it tries to replicate from db1061 it cannot write, as events with the same server_id are ignored.

So during the read_only window for T187962 we should try to:

  • Stop MySQL
  • apt full-upgrade
  • Run puppet (so the new socket is picked up)
  • Start MySQL
  • Run mysql_upgrade (can be done anytime, no need to do it during the maintenance window)

Event Timeline

Marostegui moved this task from Triage to Pending comment on the DBA board.

Change 435182 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] sanitarium_multi: Hardcode db1125 server_id

https://gerrit.wikimedia.org/r/435182

Change 435182 merged by Marostegui:
[operations/puppet@production] sanitarium_multi: Hardcode db1125 server_id

https://gerrit.wikimedia.org/r/435182

Change 435757 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1061: Upgrade socket location

https://gerrit.wikimedia.org/r/435757

Mentioned in SAL (#wikimedia-operations) [2018-05-29T09:13:00Z] <marostegui> Downtime s6 replicas for 4 hours - T195595

Change 435757 merged by Marostegui:
[operations/puppet@production] db1061: Upgrade socket location

https://gerrit.wikimedia.org/r/435757

This restart has been done, current values:

+------------+
| @@hostname |
+------------+
| db1061     |
+------------+
1 row in set (0.00 sec)

+-------------+
| @@server_id |
+-------------+
|   171974883 |
+-------------+
1 row in set (0.00 sec)

root@db1061:~# cat /etc/my.cnf | grep server_id
server_id  = 171974883
Marostegui claimed this task.

This is all done, including restarting db1125:s6 to pick up the new server_id.

Vvjjkkii renamed this task from db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart to 49baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Marostegui as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from 49baaaaaaa to db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart.Jul 1 2018, 7:54 PM
Marostegui closed this task as Resolved.
Marostegui claimed this task.
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)