NetWare Cluster Services: The Gory details of Heartbeats, Split Brains and Poison Pills

(Last modified: 18Feb2003)

This document (10053882) is provided subject to the disclaimer at the end of this document.

goal

fact

Novell NetWare 5.1

Novell NetWare 6.0

Novell Cluster Services 1.01

Novell Cluster Services 1.6

fix

Introduction
Background Information
Heartbeats
Node Failure
False Node Failure
Split Brains
False Split Brains
Poison Pills
Support Guide
Problems And Solutions
Additional Information
Observable Coincidences?

Introduction

The purpose of this document is to provide information to aid the support, tuning and successful operation of NetWare Cluster Services clusters. The document comprises two sections. The first section presents background information defining terminology and describing algorithms. The second section is organized as a problem / solution support guide. This document is not intended to be a complete cluster operations manual. It only covers heartbeats, split brains and poison pills.

Background Information
Heartbeats

Once a node has successfully joined (CLUSTER JOIN console command) a cluster, it participates in a distributed failure detection algorithm. Node failure is detected by external monitoring of a continuous heartbeat signal. If the observed heartbeat stops, the monitoring nodes will infer that the monitored node has failed.

Heartbeats are small IP packets. Each node periodically transmits its own heartbeat packet while simultaneously monitoring the heartbeat packets received from other nodes. The heartbeat period is tunable. If a heartbeat is not observed after a tunable threshold period of time, the next phase of the failure detection algorithm is executed.

It is not necessary for all nodes to monitor all other nodes in order to achieve total coverage. There will always be one cluster master node. The non-master nodes are called slaves. The master node's heartbeat packets are broadcast to all the slaves. Each slave unicasts its heartbeat packets to the master node. The master node monitors all the slaves. And, all the slave nodes monitor the master.

Every node also emits a heartbeat signal by periodically incrementing a counter value stored in a per-node sector of the shared Split Brain Detector (SBD) disk partition. Each node reads all the sectors, before writing its own.

The tunable parameters that control heartbeat period and failure detection thresholds are documented on page 28 of the NWCS overview and installation guide.

By default, LAN heartbeat packets are transmitted at the rate of one per second. Disk heartbeat I/Os occur at the rate of half the threshold value, one disk heartbeat every four seconds. The default threshold is eight seconds.

Failure detection and the next phase of the algorithm, is triggered by the continuous loss of LAN heartbeat packets for a period of time equal to or greater than the threshold parameter. Nodes that remain in contact, execute the next phase of the algorithm. Master and slave nodes communicate over the LAN in order to commit a new cluster membership view. This requires a number of packet exchanges between master and slave nodes. A new master node may be elected during this phase. The final phase of the algorithm compares heartbeat and other information found on the SBD partition.

Node Failure

When a node fails, its heartbeat stops. The surviving nodes will commit a new cluster membership view and then execute recovery steps. They will restart cluster resources like volumes, services or applications. This is done under the assumption that a failed node, because it failed, can no longer have any ownership of shared resources. It is therefore possible for the surviving nodes to assert ownership of failed resources by mounting volumes or binding secondary IP addresses, for example. This process is called failover.

Mutually exclusive ownership of cluster resources is enforced by software convention. Only one node may own any given cluster resource and all nodes maintain consistent agreement. Failed nodes are removed from the cluster and subsequently ignored. A failed node may only participate in the cluster again by re-running the cluster join protocol. This usually happens during server boot. If a node has not executed the join protocol, it is prevented from interacting with shared resources. (Nodes are allowed to run NetWare plus Cluster Services software while not a cluster member in order to support standalone server activities/maintenance).

False Node Failure

The only way that a failed node can access shared cluster resources in the future is after it has re-booted and successfully re-joined the cluster. However, there are two known cases where it is possible for a node that appears to have failed, to come back to life. Because other cluster nodes will have taken over its resources in the meantime, the resurrected node has the potential to corrupt shared state. This is because it believes it is still a member of the previous cluster membership and therefore has no reason to assume it cannot access the resources it owned before it went into its deep sleep. Once a node has been removed from the cluster, it must be prevented from returning in this manner. Surviving nodes guarantee a failed node cannot return to life by writing a special token onto its SBD sector. All nodes check their sector once per disk heartbeat period. If a node discovers the special token in its sector, it concludes that it has been cast out of the cluster by other nodes. It will voluntarily abend. This is called eating a poison pill.

In theory, false node failure detection is rare. But two scenarios are known:

1) A node enters and stays in real mode for a period of time equal to or greater than the threshold parameter.
2) A node is suspended by taking it into the kernel debugger then it is restarted after a period of time equal to or greater than the threshold parameter.

The second case should be avoided. If kernel debugging is necessary, use the console CLUSTER DEBUG command to halt all nodes. Or use Portal's non-intrusive debugging tools. Because the NetWare floppy disk driver can execute in real mode for longer than the threshold period, avoid using the server's floppy disk drive if possible.

Split Brains

If a single node (or group of nodes) somehow becomes isolated from other nodes, a condition called Split Brain results. Consider a two-node example. If the LAN heartbeat packets from master to slave and slave to master are lost, the master will assume the slave failed. Conversely, the slave will assume the master failed. Both nodes will execute the next phase of the algorithm and both will commit a new cluster membership view that excludes the opposite node. Two independent one-node clusters will be formed. Neither will be aware of the existence of the other. If allowed to persist, each cluster will failover the resources of the other. Since both clusters retain access to shared disks, corruption will occur when both clusters mount the same volumes.

In theory, split brain conditions are rare. But they can occur if the LAN experiences a hard fault such as a NIC, hub or switch hardware failure. In these cases, split brains can be avoided by deploying dual redundant LAN segments.

But if a split brain condition does occur, it must be prevented. The final phase of the algorithm achieves this. Before propagating the new cluster membership view to higher layer recovery software, the failure detection algorithm running on all cluster nodes inspects the SBD partition. Only one side of the split brain survives this phase. The other side will be forced to shutdown. The safest way to ensure nodes in the losing side of the split brain cannot corrupt shared state in the future (of the winning side), is to have them eat a poison pill (abend). Options to force an orderly shutdown of losing nodes are limited. Winning nodes must ensure losing nodes have relinquished control of shared resources before they can safely proceed. This requires communication over a LAN that has apparently failed.

The process that selects winners versus losers is a called the split brain tie breaker. If there are more nodes in one side of the split versus the other side, the majority side will always win. In the case of a dead heat; an equal number of nodes on both sides, the tie breaker will select the side that contains the previous master node. (Remember, in a split brain condition, there are two or more clusters and so multiple masters). In the special case of a two-node cluster split brain, if NIC link status can be determined the node with good LAN connectivity will win. The other node will lose.

The philosophy engrained in these algorithms is intended to increase overall service level availability. This may not be immediately apparent. In normal circumstances, LAN failures cause split brains. If LAN failure cannot be masked with redundant hardware, a node that loses contact with other nodes because of NIC failure, for example, also has lost contact with its clients. Clients therefore experience service outage. It is considered a better policy to force the failover of resources to connected nodes that clients can reach even if this means a semi-functional node will eat a poison pill.

Novell is exploring alternatives for future NWCS releases including hardware assistance from new SAN devices.

False Split Brains

LAN heartbeat packets can be lost for periods of time longer than the default threshold even in situations where the LAN hardware is functioning correctly. The result is a false split brain condition - the LAN hasn't actually failed but nodes will temporarily lose contact with each other as though it had failed. In this situation, the split brain tie breaker will usually cause a single node to be forced out of the cluster. It will eat a poison pill. If the cause of transient LAN outage is global in nature, rather than isolated to an individual (suspect) node, a larger subset of a cluster (more nodes) may eat a poison pill.

False split brains occur as a result of improper tuning or configuration. Possible factors include:

1) Insufficient number of service processes
2) Insufficient number of packet receive buffers
3) LAN switch that delays or drops broadcast packets
4) Extremely high LAN utilization / packet storms
5) Unstable LAN driver
6) Software that hogs the CPU
7) Non-optimal LAN driver load order
8) Non-optimal placement of the NIC on PCI bus
9) NIC that can transmit but not receive packets

Poison Pills

Poison pill is the term coined to describe the cluster software's intentional abends. NetWare Cluster Services offers the following flavors of poison pill. They are all equally deadly but promote overall system availability. Darwin's theory of evolution applies:

CLUSTER: Node castout, fatal SAN read error
CLUSTER: Node castout, fatal SAN write error
CLUSTER: Node castout, fatal SAN device alert

Each of these conditions are caused by a fatal I/O error or device alert signaled by the SAN device driver when invoked by the SBD. In these circumstances, a node has essentially lost its I/O path to shared disk devices. This in turn means any applications or file systems using the same I/O path will shortly experience errors also. A node in this state is essentially useless. Furthermore, attempts to shutdown applications or volumes are usually problematic. Depending on the application, it's possible to hang the server. Following the philosophy of algorithms designed to increase service level availability even at the expense of physical nodes, this poison pill ensures cluster resources are quickly failed over to other nodes with functional I/O paths. Service level availability is therefore increased. It is possible to deploy dual redundant SAN hardware to immunize against these conditions.

Ate Poison Pill in SbdProposeView given by some other node.
Ate Poison Pill in SbdWriteNodeTick given by some other node.

Both of these conditions are caused by false node failure detection.

Ate Poison Pill ' link is down: other node is alive and ticking.

This is a special case two-node split brain condition. The node that eats this poison pill has lost the tie breaker because its LAN link was determined to have failed. The other node wins. This logic is available only with the NWCS 1.01 two-node tie breaker patch. Without this patch, the master node will always win whether it has good LAN connectivity or not.

This node is in the Minority partition and the node in the Majority partition is Alive.

This is a split brain condition. One side of the split brain contains a larger number of nodes than the other side. The majority side always wins. Nodes in the minority side lose and eat this poison pill.

At least one of the nodes is Alive in the old master's node partition.
This node is NOT in the old master's node partition.

This is a split brain condition. Both sides of the split brain contain an equal number of nodes. The side that contains the previous master node wins. Nodes in the other side lose and eat this poison pill.

The Alive partition with highest node numbers should survive.
This node is NOT in the Alive partition with highest node number.

This is a split brain condition. Both sides of the split brain contain an equal number of nodes but the previous master node failed or left the cluster. The side that contains the highest numbered node wins. Nodes in the other side lose and eat this poison pill.

This cluster node failed to process its self-LEAVE event
in a timely fashion and will be forced out of the cluster.

When a node voluntarily leaves the cluster (CLUSTER LEAVE or DOWN console commands), it must first shutdown all of its cluster resources before other nodes can safely re-start them. However, in rare or improperly configured situations, this can stall the server. Some services or applications can block their unload threads and stall indefinitely. In general, it is difficult to accurately determine whether it's safe to proceed with restarting a cluster resource on another node if it hasn't successfully shutdown. Attempts to run threads in the context of the service or application in order to gather additional information usually also stall behind existing threads. In order to prevent the corruption that could potentially occur by having a partially operational cluster resource executing on one node and a second copy of the same resource restarting on another node, a node that stalls during cluster leave will eat this poison pill. This is another example of an algorithm designed to promote high service level availability. Other nodes are guaranteed to proceed after a fixed grace period of ten minutes when the stalled node will eat its poison pill. This ensures cluster resources do not stall indefinitely, they are eventually restarted on other nodes.

CRM:CRMSelfLeave: some resources went in comatose state while SelfLeave

This condition is similar to the previous one. While attempting to leave the cluster, a failure is detected when running a cluster resource unload script. In this case, it is also not possible to accurately determine the state of the resource ' whether it remains partially operational. A node will eat this poison pill to allow other nodes to safely take over.

Support Guide

For each of the four major classes of poison pill, this section describes possible diagnostic or tuning steps. Should the steps not yield successful resolution, requests for additional information are given.

Before analyzing cluster problems, it is good practice to complete a general LAN and server health check and inventory. This should include the following:

1) LAN
2) Server parameters and utilization
3) NDS
4) Licensing

Please see also Novell Technical Support (NTS) documents 2943356, 2943472, 10018663 and 10011290. More information may be found in the HIGHUTIL1.EXE document (located on NTS' minimal patch list web page).

Problems And Solutions

Fatal SAN Errors

Fatal SAN I/O errors can occur as a result of hardware problems such as Fibre Channel cable removal or GBIC / FC port failure. Hardware vendor specific SAN diagnostic tools offer the best method to locate faults of this nature. Visual inspection and/or cleaning of laser light components or systematic replacement of suspect components are other alternatives. Unstable SAN device drivers may also generate I/O errors. Ensure the driver is certified. Verify also driver, switch and/or storage controller firmware revision level dependencies.

False Node Failure Detection

Check the following:

1) Was the node taken into the kernel debugger then later allowed to continue?
2) Did the node enter real mode shortly before it ate a poison pill? Possible causes include loading an NLM from the server's floppy disk drive or rewinding a tape.
3) Is there a CPU hog? Set the server parameter CPU HOG TIMEOUT to a value equal to the heartbeat threshold parameter (default is eight seconds) to check for and identify potential offending NLMs or drivers. Report CPU hogs to Novell Technical Support.

Try increasing the cluster tolerance and slave watchdog parameters. Normally, these should be set equal. Use Novell's cluster statistics gathering tool to estimate an appropriate value: configure tolerance and slave watchdog parameters to a value larger than the time the server may be staying in real mode or longer than the server's CPU hog, for example. See page 28 of the NWCS overview and installation guide. Some experimentation may be required. First, try a large number like sixty seconds. Then systematically reduce it. But please note a larger value will impact the minimum failover time in the event of actual node failure.

Split Brain Conditions

Split brain conditions can occur as a result of LAN hardware or software problems. Check the following:

1) Check LAN driver and protocol stack statistics. Is a bad NIC intermittently dropping packets?
2) Did a NIC or cable fail? Was a cable removed? Was the LAN hub or switch power cycled?
3) Check LAN switch for settings that could delay packets (spanning tree, for example).
4) Is the server configured correctly? Check packet receive buffers and service processes for evidence of exhaustion. Adjust if necessary.

Attempt to resolve LAN issues. Then try increasing the cluster tolerance and slave watchdog parameters. Use Novell's cluster statistics gathering tool or a packet trace to estimate an appropriate value: this should be larger than the period of time during which the LAN may be experiencing transient outages, for example. Consider using a private LAN for heartbeat packets if the public LAN is heavily utilized (or unstable).

Stalled Self Leave

1) Inspect the state of the CLUSTER RESOURCES screen to determine the most recent console commands.
2) Cross check cluster resource unload scripts to find a matching cluster resource and inspect the console screen and/or console log for corresponding console commands. What was the last command executed? What cluster resource stalled the server? Does the corresponding service or application stall the server if manually loaded, then unloaded? Identify problem services or applications and report them to Novell Technical Support.
3) Is there are an error in an unload script? Perhaps a volume is dismounted before the application that uses it, is unloaded.

Additional Information

In order to assist Novell Technical Support diagnose cluster problems, some or all of the following information may be requested:

1) Cluster statistics
a. Report created by the cluster statistics gathering tool.
2) Cluster configuration
a. HTML report created by the ConsoleOne cluster view.
3) SAN configuration
a. Host bus adapter, hub or switch type.
b. Device driver name and revision.
c. Storage controller type.
d. Storage controller firmware revision.
4) SBD configuration
a. Single or mirrored partition?
5) LAN configuration
a. Network interface card, hub or switch type.
b. Device driver name and revision.
c. Dedicated heartbeat or shared public LAN?
6) Server configuration
a. Type, memory and number of CPUs.
b. Software revisions and patches.
c. List of NLMs.
d. AUTOEXEC and LDNCS NCF files.
7) LAN packet trace
8) Console log files
9) Abend log files
10) Server coredump files

Observable Coincidences?

1) Are poison pills experienced at the same time of day as scheduled maintenance activities? Examples are server inventory or health checks, virus scans, backups or periodic NDS activity?

For more information on NetWare 6.0, see the NetWare 6.0 Readme Addendum [no longer available].

document

Document Title:	NetWare Cluster Services: The Gory details of Heartbeats, Split Brains and Poison Pills
Document ID:	10053882
Solution ID:	NOVL14220
Creation Date:	05Jun2000
Modified Date:	18Feb2003
Novell Product Class:	Groupware NetWare

disclaimer

The Origin of this information may be internal or external to Novell. Novell makes all reasonable efforts to verify this information. However, the information provided in this document is for your information only. Novell makes no explicit or implied claims to the validity of this information.
Any trademarks referenced in this document are the property of their respective owners. Consult your product manuals for complete trademark information.