This is example documentation for the fairytale blog post. It is for an imaginary legacy infrastructure and is not intended to be use.
Table of contents
- Table of contents
- Quick overview
- Current considerations
- Alerts
- How to
- Fail-over
- Connect to MySQL using the MySQL client
- Recover from a replication failure
- Overriding the health check
- Promote or demote a server
- How to reboot the slave
- How to reboot the master
- Check the status of the replication
- Determine which query is currently being replicated
- Skip the the current query
Quick overview
Heading | Description |
---|---|
What | MySQL database containing users and drivers. |
Fault-tolerance | Master-slave. Slave can be promoted to master. |
Fail-over method | Manual. |
Hostnames | (PROD|STAG|DEV)-DB-(M|S)-01A eg: PROD-DB-M-01A |
Serves | (PROD|STAG|DEV)-APP-[0-9][0-9](A|B) eg: PROD-APP-01A |
Connect via | (PROD|STAG|DEV)-DBPROXY-01(A|B) eg: PROD-DBPROXY-01A |
Upstream documentation | https://dev.mysql.com/doc/ |
Current considerations
- Failover being manual: This is temporary while we solve a recurring replication issue that could lead to split brain.
- Topology: Was previously master-master. Currently master-slave to simplify replication while we debug what’s wrong with it. We’ll revisit this once we know more.
Alerts
Replication lag
Possible causes
- High load.
- Normal, but insufficient capacity.
- App bug.
- Malicious
- Broken replication.
Determining the replication problem
- Connect to the VPN.
- Connect to MySQL using the MySQL client
- Check the replication status and see what state
Exec_Source_Log_Pos
andSeconds_Behind_Source
are in.
Outcomes:
- If
Exec_Source_Log_Pos
is increasing, andSeconds_Behind_Source
is high: Only high load. - If
Exec_Source_Log_Pos
is not increasing: Replication is broken.
Fixing broken replication
How to
Fail-over
Connect to MySQL using the MySQL client
This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.
Recover from a replication failure
This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.
Overriding the health check
Force a failure:
SSH to the relevant server.
Run
touch /tmp/forceFailure
Allow the health check to operate normally:
SSH to the relevant server.
Run
rm /tmp/forceFailure
Background: If /tmp/forceFailure exists, the health check that HAProxy uses will fail. If you are unable to SSH to the machine, you can get similar results by manipulating HAProxy.
Promote or demote a server
This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.
How to reboot the slave
SSH to the server.
Run
sudo shutdown -r 0
How to reboot the master
NOTE: If you need to do the acting slave as well, do it first, to minimise the number of times you need to drain the connections.
- Connect to the VPN.
- Promote the slave to master and demote the current master.
- Check that connections have moved to the original slave..
- Follow the slave reboot steps.
- Promote the server back to master and demote the current master back to slave.
- Check that connections have moved to the original slave..
Check the status of the replication
Run this a few times over a few seconds:
SHOW REPLICA STATUS\G
- Look at
Exec_Source_Log_Pos
andSeconds_Behind_Source
are doing. You’ll need this information for what ever instructions sent you here.
- Look at
Determine which query is currently being replicated
This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.
Skip the the current query
Check the replication status and see what state
Exec_Source_Log_Pos
andSeconds_Behind_Source
are in.Run
SET GLOBAL sql_replica_skip_counter = 2;START SLAVE;
Check the replication status and see what state
Exec_Source_Log_Pos
andSeconds_Behind_Source
are in over the space of the next minutes.
At this point, Exec_Source_Log_Pos
should have now increased, and Seconds_Behind_Source
should start decreasing over time.