Affected versions: SUSE Linux Enterprise Server 16

📖 ~1 min read

Table of contents
  1. Symptom & Impact
  2. Environment & Reproduction
  3. Root Cause Analysis
  4. Quick Triage
  5. Step-by-Step Diagnosis
  6. Solution – Primary Fix
  7. Solution – Alternative Approaches
  8. Verification & Acceptance Criteria
  9. Rollback Plan
  10. Prevention & Hardening
  11. Related Errors & Cross-Refs
  12. References & Further Reading

Symptom & Impact

`crm status` shows both nodes as DC; resources flap and stonith fires repeatedly.

Environment & Reproduction

Triggered on SLES 16 HAE clusters with only one corosync ring and a noisy switch.

Root Cause Analysis

Loss of corosync heartbeats causes each node to claim quorum independently.

Quick Triage

Run `corosync-cfgtool -s` and `crm status` on each node to see token state.

Step-by-Step Diagnosis

Tail `/var/log/cluster/corosync.log` and `/var/log/pacemaker.log` during the event.

Illustrative mockup for sles-16 — hae_pacemaker-split-brain_terminal
Terminal diagnostics for Pacemaker cluster enters split-brain after network blip — Illustrative mockup — Progressive Robot

Solution – Primary Fix

Add a redundant corosync ring or qdevice and run `crm cluster restart`.

Still having issues? Our IT Solutions & Services team can diagnose and resolve this for you. Get in touch for a free consultation.

Illustrative mockup for sles-16 — hae_pacemaker-split-brain_logs
Logs and evidence for Pacemaker cluster enters split-brain after network blip — Illustrative mockup — Progressive Robot

Solution – Alternative Approaches

Use a hardware watchdog STONITH agent to fence cleanly until heartbeats stabilise.

Verification & Acceptance Criteria

`crm_mon -1` shows a single DC and all resources `Started` on the expected node.

Rollback Plan

If recovery is slow, put the cluster in maintenance mode and restart corosync per node.

Prevention & Hardening

Run periodic ring health checks via `corosync-cfgtool -R` and Prometheus alerts.

Linked to `sbd` fencing alarms when watchdog timeouts trip.

Related tutorial: View the step-by-step tutorial for sles-16.

View all sles-16 tutorials on the Tutorials Hub →

Browse all common problems & solutions on the Tutorials Hub.

References & Further Reading

SUSE Linux Enterprise High Availability Extension administration guide.

Need Expert Help?

If you cannot resolve this yourself, our team offers hands-on Server Management, Managed IT Services, and flexible Support Plans. Contact us today — we respond within one business day.