We have found an issue with db1061's server_id
Right now db1061 has a different server_id on mysql than the one on the config file.
root@db1061:~# cat /etc/my.cnf | grep server_id server_id = 171974883 root@db1061:~# mysql --skip-ssl -e "select @@server_id" +-------------+ | @@server_id | +-------------+ | 171978766 | +-------------+
The live server_id config was generated from when db1061 had the IP: 10.64.48.14 (https://gerrit.wikimedia.org/r/#/c/117358/2/templates/wmnet)
However, at somepoint it was probably moved and got the new IP it has now, which generates the server_id value that is on my.cnf but probably puppet wasn't run before starting mysql and hence it got: 171978766
The problem comes when db1125 (sanitarium) has 10.64.48.14 and thus 171978766 as server_id, when it tries to replicate from db1061 it cannot write, as events with the same server_id are ignored.
So during the read_only window for T187962 we should try to:
- Stop MySQL
- apt full-upgrade
- Run puppet (so the new socket is picked up)
- Start MySQL
- Run mysql_upgrade (can be done anytime, no need to do it during the maintenance window)
- Revert puppet hacks to hardcode server_id https://gerrit.wikimedia.org/r/#/c/435182/