Understanding high availability (HA) on SUSE Linux

High availability and business continuity are crucial to keep applications and services always operational.
High availability clusters allow critical services to keep running, even if servers or hardware components fail.
SUSE Linux offers a robust set of tools for creating and managing these clusters.
In this article, we explore the current state of clustering in SUSE Linux, with a focus on key technologies such as Pacemaker, Corosync, DRBD and others.
These, with minor differences are available on x86 and ppc64le.

Pacemaker: the brain of the cluster

Pacemaker is the engine that manages high availability clusters in SUSE Linux.
Its main function is to manage cluster resources, ensuring that critical services are operational and recover quickly in case of failure. Pacemaker continuously monitors resources (databases, web services, file systems, etc.) and, if it detects a problem, migrates those resources to other nodes in the cluster to keep them up and running.
Pacemaker stands out for its flexibility and ability to manage a wide variety of resources.
From simple services to more complex distributed systems, it is capable of handling most high-availability scenarios that a company may need.

Corosync: the cluster’s nervous system

Corosync is responsible for communication between cluster nodes.
It ensures that all nodes have the same view of the cluster status at all times, which is essential for coordinated decision making.
It also manages quorum, which determines whether there are enough active nodes for the cluster to operate safely.
If quorum is lost, measures can be taken to prevent data loss or even service downtime.

DRBD: the backbone of the data

DRBD (Distributed Replicated Block Device) is a block-level storage replication solution that replicates data between nodes in real time.
With DRBD, data from one server is replicated to another server almost instantaneously, creating an exact copy.
This is especially useful in scenarios where it is crucial that critical data is always available, even if a node fails.
Combined with Pacemaker, DRBD allows services to continue operating with access to the same data, even if they are on different nodes.

Other key technologies in SUSE Linux clusters

In addition to Pacemaker, Corosync and DRBD, there are other essential technologies for building robust clusters on SUSE Linux:

  • SBD (Storage-Based Death): SBD is a fencing tool that isolates a misbehaving node from causing problems in the cluster.
    This is achieved by using a shared storage device that nodes use to communicate their state.
  • OCF (Open Cluster Framework): OCF scripts are the basis of the resources managed by Pacemaker.
    They define how to start, stop and check the status of a resource, providing the flexibility needed to integrate a wide range of services into the cluster.
  • Csync2: A tool for synchronizing files between nodes in a cluster.
    It ensures that configuration files and other critical data are always up to date on all nodes.

Current status and future trends

Clusters in SUSE Linux have matured and are adapting to new business demands.
With the growing adoption of containerized environments and with parts in different clouds, clusters in SUSE Linux are evolving to better integrate with them.
This includes improved support for container orchestration and distributed applications that require high availability beyond replicating two disks per DRBD and keeping a virtual IP alive :) Still, today, the combination of Pacemaker, Corosync, DRBD and other tools provides a solid foundation for creating high availability clusters that can scale and adapt to the needs of SAP HANA and other solutions that require high if not total availability. If you need help at SIXE we can help you.

Cheatsheet for creating and managing clusters with Pacemaker on SUSE Linux

Here is a modest cheatsheet to help you create and manage clusters with Pacemaker on SUSE Linux.
Sharing is caring!

TaskCommand / Description
Package installation
Installing Pacemaker and Corosynczypper install -y pacemaker corosync crmsh
Basic configuration
Configure the Corosync fileEdit /etc/corosync/corosync.conf to define the transport, interfaces and network.
Start servicessystemctl start corosync && systemctl start pacemaker
Enable services at startupsystemctl enable corosync && systemctl enable pacemaker
Cluster management
View cluster statuscrm status
See node detailscrm_node -l
Add a new nodecrm node add <nombre_del_nodo>
Eject a nodecrm node remove <nombre_del_nodo>
View cluster logscrm_mon --logfile <ruta_del_log>
Resource configuration
Create a resourcecrm configure primitive <nombre_recurso> <tipo_agente> params <parámetros>
Delete a resourcecrm configure delete <nombre_recurso>
Modify a resourcecrm configure edit <nombre_recurso>
Show complete cluster configurationcrm configure show
Configuration of groups and assemblies
Create a resource groupcrm configure group <nombre_grupo> <recurso1> <recurso2> ...
Create an ordered setcrm configure colocation <nombre_conjunto> inf: <recurso1> <recurso2>
Create an execution ordercrm configure order <orden> <recurso1> then <recurso2>
Restrictions and placements
Create placement restrictioncrm configure colocation <nombre_restricción> inf: <recurso1> <recurso2>
Create location restrictioncrm configure location <nombre_ubicación> <recurso> <puntaje> <nodo>
Failover and recovery
Force migration of a resourcecrm resource migrate <nombre_recurso> <nombre_nodo>
Clear status of a resourcecrm resource cleanup <nombre_recurso>
Temporarily disable a resourcecrm resource unmanage <nombre_recurso>
Enabling a resource after disabling itcrm resource manage <nombre_recurso>
Advanced configuration
Configure the quorum`crm configure property no-quorum-policy=<freeze
Configure fencingcrm configure primitive stonith-sbd stonith:external/sbd params pcmk_delay_max=<tiempo>
Configure timeout of a resourcecrm configure primitive <nombre_recurso> <tipo_agente> op start timeout=<tiempo> interval=<intervalo>
Validation and testing
Validate cluster configurationcrm_verify --live-check
Simulate a failurecrm_simulate --run
Policy management
Configure recovery policycrm configure rsc_defaults resource-stickiness=<valor>
Configure resource prioritycrm configure resource default-resource-stickiness=<valor>
Stopping and starting the cluster
Stop the entire clustercrm cluster stop --all
Start up the entire clustercrm cluster start --all

 

SIXE