Data Center

Understanding high availability (HA) on SUSE Linux

High availability and business continuity are crucial to keep applications and services always operational.
High availability clusters allow critical services to keep running, even if servers or hardware components fail.
SUSE Linux offers a robust set of tools for creating and managing these clusters.
In this article, we explore the current state of clustering in SUSE Linux, with a focus on key technologies such as Pacemaker, Corosync, DRBD and others.
These, with minor differences are available on x86 and ppc64le.

Pacemaker: the brain of the cluster

Pacemaker is the engine that manages high availability clusters in SUSE Linux.
Its main function is to manage cluster resources, ensuring that critical services are operational and recover quickly in case of failure. Pacemaker continuously monitors resources (databases, web services, file systems, etc.) and, if it detects a problem, migrates those resources to other nodes in the cluster to keep them up and running.
Pacemaker stands out for its flexibility and ability to manage a wide variety of resources.
From simple services to more complex distributed systems, it is capable of handling most high-availability scenarios that a company may need.

Corosync: the cluster’s nervous system

Corosync is responsible for communication between cluster nodes.
It ensures that all nodes have the same view of the cluster status at all times, which is essential for coordinated decision making.
It also manages quorum, which determines whether there are enough active nodes for the cluster to operate safely.
If quorum is lost, measures can be taken to prevent data loss or even service downtime.

DRBD: the backbone of the data

DRBD (Distributed Replicated Block Device) is a block-level storage replication solution that replicates data between nodes in real time.
With DRBD, data from one server is replicated to another server almost instantaneously, creating an exact copy.
This is especially useful in scenarios where it is crucial that critical data is always available, even if a node fails.
Combined with Pacemaker, DRBD allows services to continue operating with access to the same data, even if they are on different nodes.

Other key technologies in SUSE Linux clusters

In addition to Pacemaker, Corosync and DRBD, there are other essential technologies for building robust clusters on SUSE Linux:

SBD (Storage-Based Death): SBD is a fencing tool that isolates a misbehaving node from causing problems in the cluster.
This is achieved by using a shared storage device that nodes use to communicate their state.
OCF (Open Cluster Framework): OCF scripts are the basis of the resources managed by Pacemaker.
They define how to start, stop and check the status of a resource, providing the flexibility needed to integrate a wide range of services into the cluster.
Csync2: A tool for synchronizing files between nodes in a cluster.
It ensures that configuration files and other critical data are always up to date on all nodes.

Current status and future trends

Clusters in SUSE Linux have matured and are adapting to new business demands.
With the growing adoption of containerized environments and with parts in different clouds, clusters in SUSE Linux are evolving to better integrate with them.
This includes improved support for container orchestration and distributed applications that require high availability beyond replicating two disks per DRBD and keeping a virtual IP alive :) Still, today, the combination of Pacemaker, Corosync, DRBD and other tools provides a solid foundation for creating high availability clusters that can scale and adapt to the needs of SAP HANA and other solutions that require high if not total availability. If you need help at SIXE we can help you.

Cheatsheet for creating and managing clusters with Pacemaker on SUSE Linux

Here is a modest cheatsheet to help you create and manage clusters with Pacemaker on SUSE Linux.
Sharing is caring!

Task	Command / Description
Package installation
Installing Pacemaker and Corosync	`zypper install -y pacemaker corosync crmsh`
Basic configuration
Configure the Corosync file	Edit `/etc/corosync/corosync.conf` to define the transport, interfaces and network.
Start services	`systemctl start corosync && systemctl start pacemaker`
Enable services at startup	`systemctl enable corosync && systemctl enable pacemaker`
Cluster management
View cluster status	`crm status`
See node details	`crm_node -l`
Add a new node	`crm node add <nombre_del_nodo>`
Eject a node	`crm node remove <nombre_del_nodo>`
View cluster logs	`crm_mon --logfile <ruta_del_log>`
Resource configuration
Create a resource	`crm configure primitive <nombre_recurso> <tipo_agente> params <parámetros>`
Delete a resource	`crm configure delete <nombre_recurso>`
Modify a resource	`crm configure edit <nombre_recurso>`
Show complete cluster configuration	`crm configure show`
Configuration of groups and assemblies
Create a resource group	`crm configure group <nombre_grupo> <recurso1> <recurso2> ...`
Create an ordered set	`crm configure colocation <nombre_conjunto> inf: <recurso1> <recurso2>`
Create an execution order	`crm configure order <orden> <recurso1> then <recurso2>`
Restrictions and placements
Create placement restriction	`crm configure colocation <nombre_restricción> inf: <recurso1> <recurso2>`
Create location restriction	`crm configure location <nombre_ubicación> <recurso> <puntaje> <nodo>`
Failover and recovery
Force migration of a resource	`crm resource migrate <nombre_recurso> <nombre_nodo>`
Clear status of a resource	`crm resource cleanup <nombre_recurso>`
Temporarily disable a resource	`crm resource unmanage <nombre_recurso>`
Enabling a resource after disabling it	`crm resource manage <nombre_recurso>`
Advanced configuration
Configure the quorum	`crm configure property no-quorum-policy=<freeze
Configure fencing	`crm configure primitive stonith-sbd stonith:external/sbd params pcmk_delay_max=<tiempo>`
Configure timeout of a resource	`crm configure primitive <nombre_recurso> <tipo_agente> op start timeout=<tiempo> interval=<intervalo>`
Validation and testing
Validate cluster configuration	`crm_verify --live-check`
Simulate a failure	`crm_simulate --run`
Policy management
Configure recovery policy	`crm configure rsc_defaults resource-stickiness=<valor>`
Configure resource priority	`crm configure resource default-resource-stickiness=<valor>`
Stopping and starting the cluster
Stop the entire cluster	`crm cluster stop --all`
Start up the entire cluster	`crm cluster start --all`

sixe

Siguiente Installing Windows XP on IBM Power (for fun) »

Anterior « SIXE: your trusted IBM partner

Publicado por

sixe

12 months hace

Terraform + AWS: From giant states to 3-minute deployments
"We haven't touched our AWS infrastructure in three months out of fear of breaking something."…
Does your server need replacing? The right to repair says no
The new European Right to Repair Directive is putting an end to one of the…
How to fix the most common error in Ceph
Ceph is a powerful and flexible solution for distributed storage, but like any complex tool,…