Understanding high availability (HA) on SUSE Linux
High availability and business continuity are crucial to keep applications and services always operational.
High availability clusters allow critical services to keep running, even if servers or hardware components fail.
SUSE Linux offers a robust set of tools for creating and managing these clusters.
In this article, we explore the current state of clustering in SUSE Linux, with a focus on key technologies such as Pacemaker, Corosync, DRBD and others.
These, with minor differences are available on x86 and ppc64le.
Pacemaker: the brain of the cluster
Pacemaker is the engine that manages high availability clusters in SUSE Linux.
Its main function is to manage cluster resources, ensuring that critical services are operational and recover quickly in case of failure. Pacemaker continuously monitors resources (databases, web services, file systems, etc.) and, if it detects a problem, migrates those resources to other nodes in the cluster to keep them up and running.
Pacemaker stands out for its flexibility and ability to manage a wide variety of resources.
From simple services to more complex distributed systems, it is capable of handling most high-availability scenarios that a company may need.
Corosync: the cluster’s nervous system
Corosync is responsible for communication between cluster nodes.
It ensures that all nodes have the same view of the cluster status at all times, which is essential for coordinated decision making.
It also manages quorum, which determines whether there are enough active nodes for the cluster to operate safely.
If quorum is lost, measures can be taken to prevent data loss or even service downtime.
DRBD: the backbone of the data
DRBD (Distributed Replicated Block Device) is a block-level storage replication solution that replicates data between nodes in real time.
With DRBD, data from one server is replicated to another server almost instantaneously, creating an exact copy.
This is especially useful in scenarios where it is crucial that critical data is always available, even if a node fails.
Combined with Pacemaker, DRBD allows services to continue operating with access to the same data, even if they are on different nodes.
Other key technologies in SUSE Linux clusters
In addition to Pacemaker, Corosync and DRBD, there are other essential technologies for building robust clusters on SUSE Linux:
- SBD (Storage-Based Death): SBD is a fencing tool that isolates a misbehaving node from causing problems in the cluster.
This is achieved by using a shared storage device that nodes use to communicate their state. - OCF (Open Cluster Framework): OCF scripts are the basis of the resources managed by Pacemaker.
They define how to start, stop and check the status of a resource, providing the flexibility needed to integrate a wide range of services into the cluster. - Csync2: A tool for synchronizing files between nodes in a cluster.
It ensures that configuration files and other critical data are always up to date on all nodes.
Current status and future trends
Clusters in SUSE Linux have matured and are adapting to new business demands.
With the growing adoption of containerized environments and with parts in different clouds, clusters in SUSE Linux are evolving to better integrate with them.
This includes improved support for container orchestration and distributed applications that require high availability beyond replicating two disks per DRBD and keeping a virtual IP alive :) Still, today, the combination of Pacemaker, Corosync, DRBD and other tools provides a solid foundation for creating high availability clusters that can scale and adapt to the needs of SAP HANA and other solutions that require high if not total availability. If you need help at SIXE we can help you.
Cheatsheet for creating and managing clusters with Pacemaker on SUSE Linux
Here is a modest cheatsheet to help you create and manage clusters with Pacemaker on SUSE Linux.
Sharing is caring!
Task | Command / Description |
---|---|
Package installation | |
Installing Pacemaker and Corosync | zypper install -y pacemaker corosync crmsh |
Basic configuration | |
Configure the Corosync file | Edit /etc/corosync/corosync.conf to define the transport, interfaces and network. |
Start services | systemctl start corosync && systemctl start pacemaker |
Enable services at startup | systemctl enable corosync && systemctl enable pacemaker |
Cluster management | |
View cluster status | crm status |
See node details | crm_node -l |
Add a new node | crm node add <nombre_del_nodo> |
Eject a node | crm node remove <nombre_del_nodo> |
View cluster logs | crm_mon --logfile <ruta_del_log> |
Resource configuration | |
Create a resource | crm configure primitive <nombre_recurso> <tipo_agente> params <parámetros> |
Delete a resource | crm configure delete <nombre_recurso> |
Modify a resource | crm configure edit <nombre_recurso> |
Show complete cluster configuration | crm configure show |
Configuration of groups and assemblies | |
Create a resource group | crm configure group <nombre_grupo> <recurso1> <recurso2> ... |
Create an ordered set | crm configure colocation <nombre_conjunto> inf: <recurso1> <recurso2> |
Create an execution order | crm configure order <orden> <recurso1> then <recurso2> |
Restrictions and placements | |
Create placement restriction | crm configure colocation <nombre_restricción> inf: <recurso1> <recurso2> |
Create location restriction | crm configure location <nombre_ubicación> <recurso> <puntaje> <nodo> |
Failover and recovery | |
Force migration of a resource | crm resource migrate <nombre_recurso> <nombre_nodo> |
Clear status of a resource | crm resource cleanup <nombre_recurso> |
Temporarily disable a resource | crm resource unmanage <nombre_recurso> |
Enabling a resource after disabling it | crm resource manage <nombre_recurso> |
Advanced configuration | |
Configure the quorum | `crm configure property no-quorum-policy=<freeze |
Configure fencing | crm configure primitive stonith-sbd stonith:external/sbd params pcmk_delay_max=<tiempo> |
Configure timeout of a resource | crm configure primitive <nombre_recurso> <tipo_agente> op start timeout=<tiempo> interval=<intervalo> |
Validation and testing | |
Validate cluster configuration | crm_verify --live-check |
Simulate a failure | crm_simulate --run |
Policy management | |
Configure recovery policy | crm configure rsc_defaults resource-stickiness=<valor> |
Configure resource priority | crm configure resource default-resource-stickiness=<valor> |
Stopping and starting the cluster | |
Stop the entire cluster | crm cluster stop --all |
Start up the entire cluster | crm cluster start --all |