This is example documentation for the fairytale blog post. It is for an imaginary legacy infrastructure and is not intended to be use.
We have several forms of data storage within our infrastructure, for different purposes. Originally we tried to do everything in Cassandra. And while we still use Cassandra for some purposes, like points, it simply wasn’t the right tool for the job for some of our storage needs, like driver and user information. We therefore have servers (named “DB”) running MySQL for driver and user information that are separate from other storage servers, which are named differently.
We originally ran these DB servers in a master-master topology, but changed to a master-slave topology to simplify things while we try to understand an ongoing replication issue. We’ll re-asses the topology once this is solved. When replication breaks, it breaks in one direction, but not necessarily the other direction. And it breaks silently. For this reason we have an alert for replication lag, which tests how far the slave is behind the master. If the time gets too great, an alert will fire. This can also happen if the infrastructure is under heavy load, like on a Friday night. With that in mind, it’s important to check whether the replication is broken, or just slow by running SHOW REPLICA STATUS\G
a few times. If Exec_Source_Log_Pos
is increasing, it’s slow. If Exec_Source_Log_Pos
is not changing, and Seconds_Behind_Source
is increasing, then replication is broken. If it’s slow, you need to look at other metrics throughout our infrastructure to determine whether it’s an urgent, or non-urgent capacity issue and make decisions accordingly. If the replication is indeed stopped, you need to determine why. It could be that it has been manually stopped (check the team chat). At the moment it is more common that it is stuck on a particular row. While it’s important that the data is correct, it’s more important that replication doesn’t fall too far behind. Therefore the current solution is to take note of the broken record, and then skip it. That way we can fix the inconsistency on the next working, and dig into why it failed. You can do this with SET GLOBAL sql_replica_skip_counter = 2;START SLAVE;
and make sure to check the status again.
There’s much more that would need to be included to complete this, but hopefully this gives you a good feel for the type of structure I’m referring to. As it is, this document covers a tiny faction of the information that is in the concise documentation but is significantly more work to interpret.