RandomKSandom - Example docs: DB

This is example documentation for the fairytale blog post. It is for an imaginary legacy infrastructure and is not intended to be use.

Table of contents
Quick overview
Current considerations
Alerts

Replication lag

Possible causes
Determining the replication problem
Fixing broken replication

How to

Fail-over
Connect to MySQL using the MySQL client
Recover from a replication failure
Overriding the health check
Promote or demote a server
How to reboot the slave
How to reboot the master
Check the status of the replication
Determine which query is currently being replicated
Skip the the current query

Quick overview

Heading	Description
What	MySQL database containing users and drivers.
Fault-tolerance	Master-slave. Slave can be promoted to master.
Fail-over method	Manual.
Hostnames	(PROD\|STAG\|DEV)-DB-(M\|S)-01A eg: PROD-DB-M-01A
Serves	(PROD\|STAG\|DEV)-APP-[0-9][0-9](A\|B) eg: PROD-APP-01A
Connect via	(PROD\|STAG\|DEV)-DBPROXY-01(A\|B) eg: PROD-DBPROXY-01A
Upstream documentation	https://dev.mysql.com/doc/

Current considerations

Failover being manual: This is temporary while we solve a recurring replication issue that could lead to split brain.
Topology: Was previously master-master. Currently master-slave to simplify replication while we debug what’s wrong with it. We’ll revisit this once we know more.

Alerts

Replication lag

Possible causes

High load.
- Normal, but insufficient capacity.
- App bug.
- Malicious
Broken replication.

Determining the replication problem

Connect to the VPN.
Connect to MySQL using the MySQL client
Check the replication status and see what state Exec_Source_Log_Pos and Seconds_Behind_Source are in.

Outcomes:

If Exec_Source_Log_Pos is increasing, and Seconds_Behind_Source is high: Only high load.
If Exec_Source_Log_Pos is not increasing: Replication is broken.

Fixing broken replication

Connect to the VPN.
Connect to MySQL using the MySQL client
Find out what the current query is that is breaking.
Skip it.

How to

Fail-over

Connect to MySQL using the MySQL client

This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.

Recover from a replication failure

This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.

Overriding the health check

Force a failure:

Connect to the VPN.
SSH to the relevant server.
Run
```
touch /tmp/forceFailure
```
Check that HAProxy has marked the machine as failed..

Allow the health check to operate normally:

Connect to the VPN.
SSH to the relevant server.
Run
```
rm /tmp/forceFailure
```
Check that HAProxy has marked the machine as online..

Background: If /tmp/forceFailure exists, the health check that HAProxy uses will fail. If you are unable to SSH to the machine, you can get similar results by manipulating HAProxy.

Promote or demote a server

This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.

How to reboot the slave

Connect to the VPN.
Check that there are no connections to the server..
SSH to the server.
Run
```
sudo shutdown -r 0
```
Watch for the server to come back..

How to reboot the master

NOTE: If you need to do the acting slave as well, do it first, to minimise the number of times you need to drain the connections.

Check the status of the replication

Connect to the VPN.
Connect to MySQL using the MySQL client
Run this a few times over a few seconds:
```
SHOW REPLICA STATUS\G
```
- Look at Exec_Source_Log_Pos and Seconds_Behind_Source are doing. You’ll need this information for what ever instructions sent you here.

Determine which query is currently being replicated

This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.

Skip the the current query

Connect to the VPN.
Connect to MySQL using the MySQL client
Check the replication status and see what state Exec_Source_Log_Pos and Seconds_Behind_Source are in.

Run

SET GLOBAL sql_replica_skip_counter = 2;START SLAVE;

Check the replication status and see what state Exec_Source_Log_Pos and Seconds_Behind_Source are in over the space of the next minutes.

At this point, Exec_Source_Log_Pos should have now increased, and Seconds_Behind_Source should start decreasing over time.

Example docs: DB

Table of contents

Quick overview

Current considerations

Alerts

Replication lag

Possible causes

Determining the replication problem

Fixing broken replication

How to

Fail-over

Connect to MySQL using the MySQL client

Recover from a replication failure

Overriding the health check

Promote or demote a server

How to reboot the slave

How to reboot the master

Check the status of the replication

Determine which query is currently being replicated

Skip the the current query

This post references

Posts using the same tags