In a previous blog post I wrote about the steps to take to troubleshoot and restore in its working state a Percona database cluster. This blog article adds some more information.
Finding the most advanced node
You bootstrap a failed cluster by starting the most advanced node first i.e. the node with the most recent copy of the database. This node can be found by looking at the file /var/lib/mysql/grastate.dat in all nodes:
The most advanced node is the one with the highest seqno (sequence number). You can therefore use that node to bootstrap the cluster after deleting the MySQL data directory on the other nodes, as explained previously.
Note that a seqno = -1 is normal in a running, healthy cluster, as this value is changed only when the MySQL daemon stops. However, a stopped cluster with a seqno = -1 on all nodes means that the cluster crashed badly; in this case, the best thing to do is to bootstrap from any node and then perform a restore from the latest backup.
Resetting the quorum
If the command
fails, it might be because the bootstrapped node is not part of the Primary component anymore. The Primary component is the group of nodes that own the quorum – and are therefore authorized to modify the database – in the case of a cluster split. In this case, start the node with the command
Then force the node as Primary by running this MySQL command:
EDIT (21/04/2016): On RHEL 7, the command to use to bootstrap a PXC cluster is
It might also happen that one node is running correctly but all ClusterControl recovery jobs fail, because ClusterControl refuses to synchronize the other nodes with the first. Again, in this case it’s a quorum problem. Reset the quorum by forcing the first node as Primary as shown above.
During normal conditions all nodes of a cluster must be Primary. This can be checked easily via the MySQL command