Split brain, with DRBD, is much less of a disaster than in conventional cluster setups employing shared storage. But, you ask, how can I protect my DRBD cluster against split brain in the first place? Here’s how.
Let’s briefly reiterate what split brain, in the DRBD sense, really means. DRBD split brain occurs when your nodes have lost their replication link due to network failure, and you make both nodes Primary after that.
When just the replication link dies, Heartbeat as the cluster manager will still be able to “see” the peer node via an alternate communication path (which you hopefully have configured, see this post). Thus, there is nothing that would keep Heartbeat from migrating resources to that DRBD-wise disconnected node if it so decides or is so instructed. That would cause precisely the DRBD split brain situation described above.
If that were to happen, your cluster manager will have created two diverging sets of data, which are no longer identical. When that occurs, manual intervention is, for all practical purposes, inevitable. Not a desirable situation.
Enter dopd, the DRBD outdate-peer daemon. What dopd does for you is that the second it detects a connection failure between peer DRBD nodes, it will talk to Heartbeat and instruct it to use whatever communication paths it has still available to make contact with the remote node. Then, dopd on the peer node with outdate the DRBD resource there (set the Outdated flag in DRBD metadata). DRBD will subsequently stubbornly refuse to become Primary on that node under any circumstances. That is until the network connection is re-established and DRBD is confident that the local copy of the data is UpToDate again. This effectively prevents DRBD split brain from happening, and will make sure that you cluster service will not run on a cluster node that has a bad (outdated) set of data.
To enable dopd, just add these lines to your ha.cf on both nodes:
respawn hacluster /usr/lib/heartbeat/dopd apiauth dopd gid=haclient uid=hacluster
You may have to adjust dopd‘s path according to your preferred distribution.
Afterwards, run /etc/init.d/heartbeat reload or the equivalent command for your distribution. You should now see dopd as a running process in your process table (hint: ps ax | grep dopd)
Then, add these items to your DRBD resource configuration (again, on both nodes):
common {
handlers {
outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater";
}
# other common settings go here
}
resource my-resource {
disk {
fencing resource-only;
}
#other resource-specific settings go here
}
Finally, issue drbdadm adjust all on both nodes to reconfigure your resources and reflect your drbd.conf changes.
Now, unplug your DRBD replication link. Observe /proc/drbd on your Secondary:
version: 8.0.5 (api:86/proto:86)
SVN Revision: 3011 build by buildsystem@barschlampe, 2007-08-03 07:44:08
0: cs:WFConnection st:Secondary/Unknown ds:Outdated/DUnknown C r---
ns:0 nr:14 dw:14 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0
The Secondary is now considered Outdated. If you feel like it, you may now attempt to manually switch over one of your DRBD-backed resources. It won’t come up on the remote node because it now potentially has outdated data.
Re-plug your DRBD replication link. Your Secondary will briefly re-sync and then be in UpToDate state again. A manual Heartbeat resource switch-over should now succeed.
OK . . . Back tracking a little here. Still would like to know where I can get some dopd doc’s but appearantly “drbdadm outdate all” doesn’t work unless the dopd nodes are disconnected. I tried disconnecting and it worked.
Upon closer inspection of the dopd syslog error it appears that it’s looking for drbdadm in /sbin and my distro has it in /usr/sbin. I tried copying and ln -s drbdadm, drbdsetup and drbdmeta to /sbin but neither option works and now the log shows
unknown exit code from /sbin/drbdadm outdate all: 126
instead of
unknown exit code from /sbin/drbdadm outdate all: 127
Any clues to where the path problem is?
I’ll put this on the forum as well.
Thanx
I’m looking at a setup where the only available communication between nodes would be a dsl connection. I’d like to have a dial on demand ppp connection available for situations where the replication link is broken, I can protect the data against split brain issues. Would dopd help me there?
Edward,
dopd is unlikely to help you in your scenario, as it would usually make little sense to use DRBD over a DSL connection. Except perhaps with protocol A and extremely low write load. But you’re probably better off with csync2.
I know this isn’t the “forum” but can you explain why it wouldn’t make sense to use it over a DSL link? I’ll have at least 250Kbits available for DRBD replication.
Block-level synchronous replication over a 250kbps link? Good luck. But don’t come complaining if your performance goes down the tubes.