Monday, April 22, 2019

MySQL Galera Split Brain

The Galera cluster option for MySQL is one advantage MySQL has over Postgres. The clustering allows high availability and good performance. However, it's not without its issues and one or two of those is the split-brain problem. There can be two different kinds of split brain: poorly configured clusters that can't achieve a quorum for the split you're worried about and, more rarely, an update issue with the nodes which is interesting if frustrating.

MySQL Galera clusters have worked well and provided good uptime. The normal configuration is to have more active nodes in the primary location or data center. In the event of the link between the primary and secondary sites failing, the primary cluster should continue to run. It will continue running provided it has the majority of quorum votes on its own. It is important, therefore, to make sure that the quorum is achievable without the secondary location. One option is to keep the number of active nodes higher in the primary location. Another is to adjust the pc.weight of each node to make sure that the weight is larger in the primary location. See the Galera docs about setting the weight of a node. Either of these options makes the primary location safe from failures of the other locations, but still presents a problem if the primary data center has an issue. The remaining option is to use a 3rd location or witness to break ties or provide a quorum - you could do that with your own software, with a full set of servers in a 3rd location or use Galera's own solution. Galera's solution is garbd, the Galera Arbitrator, which acts as a witness or voting system when you only really have two main locations.

The second split brain issue is more interesting - i.e. it isn't a simple configuration or quorum issue. In this one, a Galera cluster shuts down on its own after detecting an issue. All of the active nodes except one would report something like this:

 "Duplicate entry 'entry_value_being_inserted' for key 'Key_for_column', Error_code: 1062 "
It might include "handler error HA_ERR_FOUND_DUPP_KEY" as well.

The issue here is that the Galera replication has pushed updates to every node. The replication pushes the change synchronously, but applies asynchronously - flow control is used to prevent nodes from getting too far behind - see the Galera docs (and first sentence) ('commits asynchronously' from these Galera docs). As you'd expect, it's RBR - row-based replication, not statement-based. What happens here is a case of a node falling behind. Each of the other nodes see an inconsistency and ,in order to protect the cluster, shut themselves down. Unfortunately, this could be every node shuts down except the one lagging, inconsistent node. With only one node active, Galera will realize it dosen't have the majority of votes to maintain the cluster and shuts the remaining node down. In order to recover the cluster, you need to find the last node that was running and start it with the bootstrap option. Then start every other node as normal. This issue doesn't happen often based, but it's good to understand it and how to recover it when it does happen. By the way, there are some related issues that could do the same so see this link for how to recover: this for fixing this and related issues.

No comments:

Post a Comment