Understanding high availability (HA) on SUSE Linux

High availability and business continuity are crucial to keep applications and services always operational.
High availability clusters allow critical services to keep running, even if servers or hardware components fail.
SUSE Linux offers a robust set of tools for creating and managing these clusters.
In this article, we explore the current state of clustering in SUSE Linux, with a focus on key technologies such as Pacemaker, Corosync, DRBD and others.
These, with minor differences are available on x86 and ppc64le.

Pacemaker: the brain of the cluster

Pacemaker is the engine that manages high availability clusters in SUSE Linux.
Its main function is to manage cluster resources, ensuring that critical services are operational and recover quickly in case of failure. Pacemaker continuously monitors resources (databases, web services, file systems, etc.) and, if it detects a problem, migrates those resources to other nodes in the cluster to keep them up and running.
Pacemaker stands out for its flexibility and ability to manage a wide variety of resources.
From simple services to more complex distributed systems, it is capable of handling most high-availability scenarios that a company may need.

Corosync: the cluster’s nervous system

Corosync is responsible for communication between cluster nodes.
It ensures that all nodes have the same view of the cluster status at all times, which is essential for coordinated decision making.
It also manages quorum, which determines whether there are enough active nodes for the cluster to operate safely.
If quorum is lost, measures can be taken to prevent data loss or even service downtime.

DRBD: the backbone of the data

DRBD (Distributed Replicated Block Device) is a block-level storage replication solution that replicates data between nodes in real time.
With DRBD, data from one server is replicated to another server almost instantaneously, creating an exact copy.
This is especially useful in scenarios where it is crucial that critical data is always available, even if a node fails.
Combined with Pacemaker, DRBD allows services to continue operating with access to the same data, even if they are on different nodes.

Other key technologies in SUSE Linux clusters

In addition to Pacemaker, Corosync and DRBD, there are other essential technologies for building robust clusters on SUSE Linux:

  • SBD (Storage-Based Death): SBD is a fencing tool that isolates a misbehaving node from causing problems in the cluster.
    This is achieved by using a shared storage device that nodes use to communicate their state.
  • OCF (Open Cluster Framework): OCF scripts are the basis of the resources managed by Pacemaker.
    They define how to start, stop and check the status of a resource, providing the flexibility needed to integrate a wide range of services into the cluster.
  • Csync2: A tool for synchronizing files between nodes in a cluster.
    It ensures that configuration files and other critical data are always up to date on all nodes.

Current status and future trends

Clusters in SUSE Linux have matured and are adapting to new business demands.
With the growing adoption of containerized environments and with parts in different clouds, clusters in SUSE Linux are evolving to better integrate with them.
This includes improved support for container orchestration and distributed applications that require high availability beyond replicating two disks per DRBD and keeping a virtual IP alive :) Still, today, the combination of Pacemaker, Corosync, DRBD and other tools provides a solid foundation for creating high availability clusters that can scale and adapt to the needs of SAP HANA and other solutions that require high if not total availability. If you need help at SIXE we can help you.

Cheatsheet for creating and managing clusters with Pacemaker on SUSE Linux

Here is a modest cheatsheet to help you create and manage clusters with Pacemaker on SUSE Linux.
Sharing is caring!

Task Command / Description
Package installation
Installing Pacemaker and Corosync zypper install -y pacemaker corosync crmsh
Basic configuration
Configure the Corosync file Edit /etc/corosync/corosync.conf to define the transport, interfaces and network.
Start services systemctl start corosync && systemctl start pacemaker
Enable services at startup systemctl enable corosync && systemctl enable pacemaker
Cluster management
View cluster status crm status
See node details crm_node -l
Add a new node crm node add <nombre_del_nodo>
Eject a node crm node remove <nombre_del_nodo>
View cluster logs crm_mon --logfile <ruta_del_log>
Resource configuration
Create a resource crm configure primitive <nombre_recurso> <tipo_agente> params <parámetros>
Delete a resource crm configure delete <nombre_recurso>
Modify a resource crm configure edit <nombre_recurso>
Show complete cluster configuration crm configure show
Configuration of groups and assemblies
Create a resource group crm configure group <nombre_grupo> <recurso1> <recurso2> ...
Create an ordered set crm configure colocation <nombre_conjunto> inf: <recurso1> <recurso2>
Create an execution order crm configure order <orden> <recurso1> then <recurso2>
Restrictions and placements
Create placement restriction crm configure colocation <nombre_restricción> inf: <recurso1> <recurso2>
Create location restriction crm configure location <nombre_ubicación> <recurso> <puntaje> <nodo>
Failover and recovery
Force migration of a resource crm resource migrate <nombre_recurso> <nombre_nodo>
Clear status of a resource crm resource cleanup <nombre_recurso>
Temporarily disable a resource crm resource unmanage <nombre_recurso>
Enabling a resource after disabling it crm resource manage <nombre_recurso>
Advanced configuration
Configure the quorum `crm configure property no-quorum-policy=<freeze
Configure fencing crm configure primitive stonith-sbd stonith:external/sbd params pcmk_delay_max=<tiempo>
Configure timeout of a resource crm configure primitive <nombre_recurso> <tipo_agente> op start timeout=<tiempo> interval=<intervalo>
Validation and testing
Validate cluster configuration crm_verify --live-check
Simulate a failure crm_simulate --run
Policy management
Configure recovery policy crm configure rsc_defaults resource-stickiness=<valor>
Configure resource priority crm configure resource default-resource-stickiness=<valor>
Stopping and starting the cluster
Stop the entire cluster crm cluster stop --all
Start up the entire cluster crm cluster start --all

 

SiXe Ingeniería
×