Tag Archive for: LibrePower

Run an LLM on IBM i via PASE — No Linux Required

IBM i · March 2026

We Ran an LLM on IBM i. No Linux. No Cloud. No GPU.

llama.cpp compiled for AIX runs natively on IBM i via PASE. Your RPG programs can call a local language model without adding infrastructure or sending data anywhere.

March 20268 min read

If you manage an IBM i system, you know how this conversation goes. Someone asks about AI, and the answers are always the same: "spin up a Linux LPAR", "use OpenAI", "check out Wallaroo". Every option means leaving the platform, adding layers, and at some point sending business data to a server you don't control.

There are 150,000 IBM i systems processing transactions in banking, insurance, and healthcare. The answer can't always be "add more infrastructure". So we tried something different.

The experiment

What we actually did

We took llama.cpp — the most widely used open-source LLM inference engine — compiled it for AIX, and copied the binary to an IBM i V7R5 partition. We ran it via PASE. It worked on the first try.

$ uname -a
OS400 WWW 5 7 007800001B91

$ /QOpenSys/pkgs/bin/python3 -c "import platform; print(platform.platform())"
OS400-5-007800001B91-powerpc-64bit

$ /QOpenSys/pkgs/bin/python3 -c "import sys; print('Byte order:', sys.byteorder)"
Byte order: big

That's IBM i V7R5 on pub400.com — a public IBM i system. Big-endian, powerpc-64bit, OS400. Not Linux, not AIX. IBM i.

What kind of binary

$ file llama/llama-simple
llama/llama-simple: 64-bit XCOFF executable or object module

A 64-bit XCOFF binary — the native executable format for AIX. Compiled on AIX 7.3 POWER using GCC 13.3 with VSX vector extensions enabled. The same binary from our llama-aix project, which already ships 10 big-endian GGUF models on HuggingFace.

First run

$ LIBPATH=/home/HBSIXE/llama /home/HBSIXE/llama/llama-simple --help

example usage:

    /home/HBSIXE/llama/llama-simple -m model.gguf [-n n_predict] [prompt]

The binary loads, links libggml and libllama, parses arguments, and responds. All inside PASE. To run actual inference, you point it at a big-endian GGUF model:

$ LIBPATH=/home/HBSIXE/llama /home/HBSIXE/llama/llama-simple \
    -m models/tinyllama-1.1b-q4_k_m-be.gguf \
    -p "What is IBM i?" -n 100 -t 4
IBM i PASE terminal running llama.cpp: the XCOFF binary loads, links libraries and responds to a prompt in real time
The context

Why this matters for IBM i shops

In 2026, the AI conversation in the IBM i community is louder than ever. IBM just launched Bob (the successor to WCA for i), a coding assistant for RPG developers. 70% of IBM i customers plan hardware upgrades this year. And yet there's one question that still doesn't have a clean answer:

How do I integrate an LLM into my IBM i applications without depending on an external service?

The usual options, right now:

OptionWhat it meansThe catch
Linux LPARSpin up a separate partition, run the LLM there, call it from RPG via APINew hardware to manage, added cost, data crosses partition boundaries
Cloud APICall OpenAI, Azure, or AWS from RPGBusiness data leaves the machine. A serious problem in banking, insurance, and healthcare
WallarooOption 1 packaged as a service$500/month. Still a Linux LPAR with branding
PASE + llama.cppThe LLM runs inside IBM i itself, via PASENo extra hardware. Data never leaves the partition.
What about IBM Bob?
Bob is for the developer: it helps understand, document, and generate RPG code from the IDE. What we describe here is for the production application: an LLM running in the same partition that any RPG program can call like a local API. They solve different problems. Bob for the dev workflow, local inference for the apps themselves.
The technical foundation

PASE: the bridge you already have

PASE (Portable Application Solutions Environment) is a runtime built into IBM i that executes AIX binaries natively. It's not emulation — it's a layer that exposes AIX system calls directly on top of the IBM i kernel. If something runs on AIX, it can run on IBM i via PASE.

┌──────────────────────────────────────────┐ IBM i (OS400) │ ┌──────────────┐ ┌────────────────┐ │ │ │ RPG / CL │ │ PASE │ │ │ │ COBOL / Db2 │───→│ (AIX runtime) │ │ │ │ │ │ │ │ │ │ localhost │ │ llama-server │ │ │ │ :8080 │ │ + GGUF model │ │ │ └──────────────┘ └────────────────┘ │ IBM POWER Hardware └──────────────────────────────────────────┘

We've been building and shipping AIX packages through LibrePower's AIX repository for years — over 30 open-source packages installable via DNF. When llama.cpp joined the catalogue, testing the jump to IBM i was the natural next step. PASE handles the rest.

For IBM i administrators

You don't need to install anything special on the operating system. PASE is already active. All you need is the XCOFF binary of llama.cpp and a big-endian GGUF model. The LLM runs as a regular PASE process, without touching the native IBM i environment.

The technical hurdle

The big-endian problem (and how we solved it)

There's a reason nobody had done this cleanly before: byte order. IBM i and AIX are big-endian. Virtually all AI software — x86, ARM, Linux ppc64le — assumes little-endian. A GGUF file downloaded from HuggingFace won't load on IBM i: the bytes are in the wrong order.

We'd already solved this in our AIX work. The solution: convert the models before distributing them. We publish big-endian GGUF models at huggingface.co/librepowerai, validated on real AIX hardware and ready to load directly on IBM i PASE.

ModelSizeQuantization
TinyLlama 1.1B Chat668 MBQ4_K_M
LFM 1.2B Instruct695 MBQ4_K_M
LFM 1.2B Thinking731 MBQ4_K_M
7 more available

These are the same models that reach 10–12 tok/s on AIX POWER. On IBM i POWER10 — with MMA hardware acceleration active via OpenBLAS — performance should be comparable or better. Concrete IBM i benchmarks are in progress.

From PoC to production

From proof of concept to production

Running --help proves the binary loads. The real path to useful AI in your applications has three stages, and the first one is available right now.

Stage 1: Direct inference (available now)

From any SSH or QSH session on the IBM i:

# Direct inference from the command line
LIBPATH=/path/to/llama /path/to/llama/llama-simple \
    -m /path/to/model.gguf \
    -p "Summarize this purchase order" -n 200 -t 8

Useful for CL scripts, batch jobs, or just verifying that the model loads and responds correctly on your specific hardware before going further.

Stage 2: OpenAI-compatible API server (coming soon)

llama.cpp includes llama-server, which exposes an HTTP endpoint compatible with the OpenAI API. Once running in PASE, any RPG program can call it using QSYS2.HTTP_POST — exactly like any other API:

# Start the inference server on IBM i via PASE
LIBPATH=/path/to/llama /path/to/llama/llama-server \
    -m /path/to/model.gguf \
    --host 0.0.0.0 --port 8080 -t 8
// Call it from RPG — the LLM is on localhost
dcl-s url varchar(256) inz('http://localhost:8080/v1/chat/completions');
dcl-s body varchar(65535);
dcl-s response varchar(65535);
// QSYS2.HTTP_POST — no data leaves IBM i

The important part: localhost. The model is on the same machine. Data never leaves the partition.

Stage 3: Business application integration (in development)

  • Document analysis: feed Db2 reports to the LLM for automatic summarization
  • Natural language queries: the user types in plain English, the LLM returns SQL
  • RPG code modernization: the LLM analyzes and documents existing programs without leaving IBM i
  • Intelligent monitoring: analyze QSYSOPR messages and job logs with semantic context
A note on performance: small models (1–2B parameters) running in PASE are more than enough for classification, summarization, structured data extraction, and fixed-format responses. For longer text generation or complex reasoning, 7B+ models scale well with more threads. IBM i POWER10 benchmarks are in progress.
Hands-on

How to try it yourself

If you have access to an IBM i with PASE active, it's three steps.

1. Get the llama.cpp binary for AIX

Available on LibrePower's GitLab. If you have DNF/yum configured:

# From AIX (or via PASE if you have dnf)
dnf install llama-cpp

2. Download a big-endian model

curl -L -o tinyllama-be.gguf \
  "https://huggingface.co/librepowerai/TinyLlama-1.1B-Chat-v1.0-GGUF-big-endian/resolve/main/tinyllama-1.1b-q4_k_m-be.gguf"

TinyLlama is a solid starting point: 668 MB, fast to load, and enough to verify everything works before moving to larger models.

3. Run inference

LIBPATH=/path/to/llama ./llama-simple \
    -m tinyllama-be.gguf \
    -p "What is IBM i?" \
    -n 150 -t 4
IBM i in production?

SIXE has been supporting IBM i environments for years. If you want to understand whether this approach fits your architecture — or what it means for your RPG applications — get in touch. No strings attached.

Roadmap

What's next

This is a solid proof of concept, not a finished product. Here's what we're working on next:

  • llama-server on IBM i — the HTTP API server running in PASE, documented and packaged so you can get it running in minutes
  • RPG integration examples — real code for calling the LLM from RPG programs via QSYS2.HTTP_POST
  • IBM i POWER10/POWER11 benchmarks — real tok/s measurements with PASE on production hardware
  • Larger models — testing 7B+ models on partitions with enough memory
  • vLLM for IBM i — our vLLM package for ppc64le, adapted to run in PASE

More from LibrePower

ProjectWhat it does
llama-aixllama.cpp for AIX with 10 big-endian GGUF models ready to download
linux.librepower.orgAPT repository with vLLM for Linux ppc64le (Ubuntu/Debian)
aix.librepower.org30+ open-source packages for AIX, installable via DNF

Got IBM i with PASE?

Try the LLM on your own partition

The binary is on GitLab. The models are on HuggingFace. If you have PASE access and a few minutes, you can replicate exactly what we describe here :)

We just made LLM inference a sudo apt install on IBM POWER


LibrePower · Linux on Power · March 2026

vLLM on IBM POWER: LLM inference without a GPU

The first pre-built vLLM package for Linux ppc64le. Built by the community — and it runs on hardware you already own.

March 202618 min read


If you run IBM POWER systems, you know the drill. The hardware is extraordinary — POWER9, POWER10 and POWER11 deliver unmatched RAS, memory bandwidth, and per-core throughput. But when it comes to the AI/ML ecosystem, you’ve historically had two options: bring your own GPUs (usually x86), or go through Red Hat OpenShift AI. Today, there’s a third option for vLLM on IBM POWER. One that takes about 30 seconds, works on hardware you already own, and uses MMA hardware acceleration automatically.

The package

What we built: vLLM on IBM POWER as a .deb package

vLLM is the most popular open-source LLM serving engine. It powers inference at companies running millions of requests per day. It supports the full OpenAI API: /v1/chat/completions, /v1/completions, /v1/models — streaming, function calling, tool use, the works.

The problem? There were no pre-built packages for ppc64le. Not on PyPI. Not in Ubuntu’s repos. If you wanted vLLM on IBM POWER, you were on your own — figuring out build dependencies, patching C++ extensions, compiling PyTorch from source. IBM’s own community has documented how complex the manual setup is.


So we compiled it. On real IBM POWER hardware. Optimized for the architecture. And packaged it as a .deb that APT can install with full dependency resolution.

$ apt-cache show python3-vllm

Package: python3-vllm
Version: 0.9.2-1
Architecture: ppc64el
Maintainer: LibrePower <packages@librepower.org>
Depends: python3 (>= 3.10), python3-numpy, python3-requests
Homepage: https://linux.librepower.org
Description: OpenAI-compatible LLM inference server for ppc64le
Ubuntu on IBM POWER
Running Ubuntu on IBM POWER is a key enabler for this workflow. SIXE deploys and supports Ubuntu ppc64le environments with Canonical partnership — the same infrastructure that makes this apt-based installation possible.

Under the hood

The journey: from source to package

Getting vLLM to run on POWER is not a trivial pip install. Here’s what was involved.

PyTorch on POWER

vLLM depends on PyTorch, which is not distributed for ppc64le via PyPI. IBM publishes wheels at wheels.developerfirst.ibm.com — we leverage those as the base. See the full list of IBM-supported developer tools for POWER.

The C extension

vLLM’s performance-critical path is a C++ extension (_C.abi3.so) that handles attention, caching, activation functions, and quantization. This needs to be compiled from source with CMake, linking against PyTorch’s C++ API and oneDNN for optimized GEMM operations.

-- PowerPC detected
-- CPU extension compile flags: -fopenmp -DVLLM_CPU_EXTENSION
   -mvsx -mcpu=power9 -mtune=power9
-- CPU extension source files: csrc/cpu/quant.cpp csrc/cpu/activation.cpp
   csrc/cpu/attention.cpp csrc/cpu/cache.cpp csrc/cpu/utils.cpp
   csrc/cpu/layernorm.cpp csrc/cpu/pos_encoding.cpp
[100%] Linking CXX shared module _C.abi3.so
[100%] Built target _C

The resulting binary includes oneDNN with PPC64 GEMM kernels — the same math library that Intel uses for x86, but targeting POWER’s vector units.

Dependency resolution

The Python ecosystem on ppc64le has gaps. Some packages have pre-built wheels, others need compilation from source, and a few have version conflicts. We resolved all of this so you don’t have to.

In practice

Running LLM inference on IBM POWER: code and output

Here’s what it looks like in practice. First, install the package:

# Add the LibrePower APT repository
curl -fsSL https://linux.librepower.org/install.sh | sudo sh

# Install vLLM for ppc64le
sudo apt update
sudo apt install python3-vllm

# Install PyTorch wheels from IBM
pip3 install torch --extra-index-url \
  https://wheels.developerfirst.ibm.com/ppc64le/linux

Then run inference from Python:

# Python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    dtype="bfloat16",
    device="cpu",
    enforce_eager=True
)

output = llm.generate(
    ["Explain quantum computing in simple terms."],
    SamplingParams(temperature=0, max_tokens=100)
)

print(output[0].outputs[0].text)

But vLLM’s real value is the OpenAI-compatible server mode — this is what makes it useful for production:

# Start the OpenAI-compatible inference server
python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --device cpu --dtype bfloat16 --port 8000
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [{"role": "user", "content": "What is IBM POWER?"}],
    "max_tokens": 100
  }'

LangChain, LlamaIndex, Open WebUI, Continue.dev — anything that can point to an OpenAI endpoint works out of the box. Change base_url to your POWER server and you’re done. This is what makes CPU inference on IBM POWER a realistic path to private, self-hosted LLM deployment.

The numbers

Performance: real numbers on POWER9, POWER10 and POWER11

We benchmarked on both POWER generations with Qwen2.5-0.5B-Instruct (494M parameters, BF16). These are not theoretical numbers — they come from running the benchmark tool on actual hardware.

POWER9

$ OMP_NUM_THREADS=12 python3 bench_vllm.py
Run 1: 17.8 tok/s (100 tokens in 5.6s)
Run 2: 16.7 tok/s (100 tokens in 6.0s)
Run 3: 18.5 tok/s (100 tokens in 5.4s)
BENCH P9: threads=12 avg=17.6 tok/s

12 threads is optimal — more threads add cache contention on this memory-bandwidth-bound workload.

POWER10

$ OMP_NUM_THREADS=1 python3 bench_vllm.py
Run 1: 13.9 tok/s (100 tokens in 7.2s)
BENCH P10: threads=1 avg=13.9 tok/s
13.9 tok/s from a single POWER10 core. For context, the POWER9 result uses 12 threads across multiple cores to achieve 17.6 tok/s. The per-core efficiency improvement from POWER9 to POWER10 is dramatic, driven by MMA hardware acceleration. POWER11 shares the same MMA architecture with further enhancements.
SystemThreadstok/sPer-core efficiency
POWER10/11113.913.9 tok/s/core
POWER91217.61.5 tok/s/core

This isn’t competing with an A100 — it’s filling a completely different gap: running LLM inference on IBM POWER infrastructure you already own. No GPU budget, no PCIe slots, no driver headaches. For organizations with existing POWER9, POWER10 or POWER11 servers, this is a zero-capital path to private AI.

We also tested Qwen2.5-7B-Instruct (7 billion parameters) on a single POWER10 core — it loaded and ran at 1.0 tok/s. Not fast enough for interactive use on one core, but proof that larger models work. With more cores, this scales linearly. Those running IBM POWER training courses through SIXE often ask about AI workloads on existing hardware — these numbers are the answer.

Inside the machine

What actually happens when a POWER10/11 runs an LLM

If you’ve seen IBM’s presentations about AI on POWER, you’ve probably encountered terms like MMA, Spyre, oneDNN, and OpenShift AI. They’re often shown together on the same slide. But what do they actually mean? And which ones are active when you run python3 -m vllm?

We went deep into the software stack to answer this. The findings surprised us.

A quick glossary (no jargon left behind)

  • LLM (large language model) — Software that generates text — ChatGPT, Llama, Qwen. A mathematical model with billions of numbers that predicts the next word.
  • Inference — Running a trained model to get answers. Training teaches the model; inference uses it. This article is entirely about inference.
  • Token — A word or piece of a word. “17.6 tokens per second” means roughly 17–18 words per second.
  • BF16 (bfloat16) — A way to store numbers using 16 bits instead of 32. Half the memory, nearly the same precision. Think: “good enough quality at half the storage cost.”
  • GEMM (general matrix multiply) — The core math operation in neural networks. Most compute time in LLM inference is spent multiplying large matrices.
  • MMA (matrix-multiply accumulate) — Special-purpose circuitry inside POWER10 and POWER11 designed to accelerate matrix math. Like a dedicated calculator for the one specific operation that dominates LLM inference.
  • OpenBLAS — An open-source math library with optimized GEMM routines. The engine that does the actual matrix multiplication on POWER.
  • oneDNN — Intel’s math library, also compiled into vLLM. Another engine for the same purpose.
  • PyTorch — The framework that runs the neural network. It calls OpenBLAS or oneDNN for the heavy math.

How the pieces fit together

When vLLM generates a token, here’s the exact path through the machine:

You type a question

vLLM receives it and breaks it into tokens

PyTorch runs the neural network math

For each layer: multiply large matrices (GEMM)

PyTorch asks OpenBLAS: “multiply these two BF16 matrices”

OpenBLAS runs sbgemm_kernel_power10 ← THIS USES MMA

POWER10/11 hardware executes MMA instructions

Result flows back up, next token is chosen

You see the next word appear
MMA acceleration is already active in our benchmarks. It’s not a future feature or a configuration flag — it works right now, through the path PyTorch → OpenBLAS → MMA hardware. No special setup required.

Proving it: BF16 vs FP32 on POWER10/11

On POWER10 and POWER11, MMA accelerates BF16 math. On POWER9 (no MMA), BF16 is actually slower than FP32 due to software emulation. If MMA is working, BF16 should be faster:

# Raw matrix multiplication benchmark (1024×1024) on POWER10
BF16: 384.4 GFLOPS  (5.6 ms)
FP32: 249.6 GFLOPS  (8.6 ms)
BF16/FP32 ratio: 1.54x

BF16 is 1.54× faster than FP32. MMA is active and delivering measurable hardware acceleration. Our 13.9 tok/s on a single POWER10 core already includes MMA. That’s the real, hardware-accelerated number. The power of POWER10 and POWER11’s AI acceleration capabilities is something we cover in depth in our Linux on IBM POWER Systems training courses.

The oneDNN investigation (and what we learned)

We initially thought there might be hidden performance left on the table.

The vLLM build bundles oneDNN (originally from Intel). Inside, there are two POWER-specific math paths:

  • int8 GEMM: A hand-written kernel by IBM engineers using MMA instructions for quantized models.
  • BF16 GEMM: A passthrough to OpenBLAS — but only when compiled with specific flags.

Our initial build didn’t have those flags. We recompiled with -DDNNL_BLAS_VENDOR=OPENBLAS, confirmed the flags were active, benchmarked again — same performance.

Why? PyTorch was already going directly to OpenBLAS, bypassing oneDNN for the main matrix operations. The optimization was already there; we just didn’t know it.

Practical takeaway: You don’t need to configure anything special. PyTorch on POWER10 and POWER11 with OpenBLAS automatically uses MMA for BF16 inference. Install the package and run.

What about IBM Spyre?

IBM Spyre is a dedicated AI accelerator card for POWER — a completely separate piece of hardware with its own silicon for AI math. Think of it this way:

  • MMA = built-in acceleration inside every POWER10 and POWER11 core (active right now in our benchmarks)
  • Spyre = a separate AI accelerator card you add to the system (promising, but requires specific IBM software stacks)

Our work focuses on what’s available today using the CPU already in your machine, with zero additional hardware investment.

The complete picture

TechnologyWhat it is (plain English)Active in our build?
POWER10/11 MMA (BF16)Built-in matrix accelerator in the CPUYes — PyTorch → OpenBLAS
POWER10/11 MMA (int8)Same hardware, for 8-bit quantized modelsBuilt, not end-to-end yet
IBM SpyreSeparate AI accelerator cardNo — different hardware
OpenShift AIFull ML platform on KubernetesNo — we’re the lightweight path
oneDNNMath library bundled with vLLMCompiled in, bypassed by PyTorch
OpenBLASMath library with hand-tuned POWER10/11 kernelsYes — the real workhorse

Context

The bigger picture: LLM inference on IBM POWER without OpenShift

Red Hat OpenShift AI

Until now, the official IBM/Red Hat play for LLM inference on IBM POWER was OpenShift AI. It supports notebooks, pipelines, model training, serving, and monitoring. As of version 3.0, it runs on ppc64le with CPU-only workloads.

OpenShift AI is the right choice for organizations that already have OpenShift clusters. It comes with RBAC, InstructLab for model fine-tuning, and enterprise support.

But it requires OpenShift. A Kubernetes cluster, a Red Hat subscription, operator management. For many POWER shops — especially those running standalone Linux or mixed AIX/Linux — that’s a significant commitment just to serve a model. Organizations managing these environments often rely on SIXE’s IBM POWER maintenance and support services to keep them running.

What LibrePower adds

We’re not replacing OpenShift AI. We’re complementing it with a lighter path for the many POWER sites that don’t need the full platform.

OpenShift AILibrePower vLLM
InstallOpenShift cluster + operatorsapt install python3-vllm
InfrastructureKubernetes requiredAny Ubuntu/Debian ppc64le
ScopeFull ML lifecycleInference serving only
SupportRed Hat subscriptionCommunity (open source)
GPUSupported (x86)CPU-only (POWER native)
Time to first inferenceHours to daysMinutes
CostOpenShift licensingFree

IBM builds the highway — world-class hardware, PyTorch wheels, OpenShift AI, InstructLab. LibrePower adds an on-ramp for people who don’t need the full platform. Both are needed. IBM’s roadmap for AI on IBM POWER is moving fast, and community tooling like this fills real gaps in the ecosystem today.

The infrastructure

How the LibrePower package repository works

We built linux.librepower.org following the same pattern as our AIX package repository — infrastructure that already serves 30+ open-source packages to AIX systems worldwide.

linux.librepower.org/
  dists/jammy/
    InRelease          (GPG signed)
    Release
    main/binary-ppc64el/
      Packages
  pool/main/
    python3-vllm_0.9.2-1_ppc64el.deb
  install.sh

CI/CD runs on GitLab: every push regenerates APT metadata and deploys automatically. All packages compiled on real IBM POWER hardware — not cross-compiled, not emulated. The full source is on GitLab under Apache 2.0.

Roadmap

What’s next for vLLM on IBM POWER

  • More models tested — Llama, Mistral, Phi, Granite. Systematic benchmarks across model families.
  • llama.cpp for ppc64le — GGUF quantized models for even lower memory footprint. Already shipping for AIX.
  • Ubuntu 24.04 and Debian 12 support — Extending the package to the latest LTS releases.
  • POWER10/11-optimized variants — Going deeper into MMA tuning. Our current 13.9 tok/s per core is a starting point, not a ceiling.
  • int8 GEMM end-to-end — Completing the MMA path for quantized models, which should improve throughput further.
Want to run AI workloads on your IBM POWER infrastructure?
SIXE helps organizations deploy and operate Linux on IBM POWER — from official IBM Linux on Power training to full infrastructure support. If you’re evaluating LLM inference on existing POWER hardware, talk to us.

Got a ppc64le system?

Try vLLM on IBM POWER now

If you have a system running Ubuntu, it’s three commands. Source is on GitLab if you want to dig in or contribute. IBM POWER training and infrastructure support by SIXE.

# Add the LibrePower repository
curl -fsSL https://linux.librepower.org/install.sh | sudo sh

# Install vLLM for ppc64le
sudo apt update && sudo apt install python3-vllm

# Install PyTorch (IBM wheels)
pip3 install torch --extra-index-url \
  https://wheels.developerfirst.ibm.com/ppc64le/linux

# Run the OpenAI-compatible inference server
python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --device cpu --dtype bfloat16 --port 8000

Running Liquid AI’s New Model on IBM AIX (No GPU Required)

Forget the H100 clusters for a moment. At SIXE, we decided to push enterprise hardware to its absolute limits to answer a burning question: Can a 2018-era IBM Power System, running AIX and relying purely on CPU, handle the latest generation of AI models?

We took Liquid AI’s new LFM2.5-1.2B model and ran it on an IBM POWER9 processor. To our knowledge, this is the first time an LFM2.5 model has ever run on AIX in Big-Endian mode.

The Result?

Nearly 27 tokens per second, coherent responses, and under 750 MB of memory usage. No GPU. No NPU. Just raw Power architecture muscle.

But raw speed is only half the story. To prove this isn’t just a benchmark toy, we put LFM2.5 through a “SysAdmin Gauntlet”—real AIX administrative tasks—and compared it against a standard Transformer (TinyLlama 1.1B). The results were shocking.

The “Secret Sauce”: What is LFM2.5?

LFM2.5 is a hybrid architecture designed for extreme efficiency, mixing Convolutional blocks (shortconv) for speed and Attention layers (GQA) for context. It features a massive 128k context window—enough to read thousands of lines of logs without forgetting the beginning.

The Hardware: IBM Power System S924

We used the workhorse of the enterprise world. Below are the specific configurations used for this benchmark:

SpecValue
ServerIBM Power System S924
ProcessorIBM POWER9 @ 2.75 GHz
ConfigurationEnterprise Multi-Core (SMT-8 capable)
OSAIX 7.3 TL4
ArchitectureBig-Endian

Performance: Finding the Sweet Spot

We found that throwing every core at the model actually hurts performance due to synchronization overhead. The verdict was clear: Using just 8 cores in SMT-2 mode (16 threads) gave us 26.7 tokens per second.


Real-World Test: The SysAdmin Gauntlet

Numbers are nice, but can it actually work? We ran the exact same prompts through both LFM2.5-1.2B and TinyLlama 1.1B (a pure transformer of similar size). Here are the raw outputs.

Round 1: The Cryptic Error (errpt)

We fed the models a raw AIX error log regarding a power supply failure.

The Input:

$ errpt -a -l BFE4C025

Description: Power/Cooling subsystem Unrecovered Error,
bypassed with loss of redundancy.
Possible FRUs:
Priority: H
FRU: PWRSPLY
Location: U78D2.001.WZS00P4

  • ❌ TinyLlama 1.1B:

    “7. What is the location of the FRU that is affected by this error? The FRU that is affected by this error is PWRSPLY. 8. What is the description of the FRU that is affected by this error? The FRU that is affected by this error is PWRSPLY…”

    Verdict: Fails. It got stuck in an infinite loop repeating “PWRSPLY”.

  • ✅ LFM2.5 1.2B:

    “What is the likely cause? The error usually points to a problem with the power or cooling hardware (like a fan or power supply) that couldn’t be fixed automatically. It might be a hardware failure or a misconfiguration. What action should I take? First, check the hardware — look at the fans (especially the power fans)…”

    Verdict: Pass. It identified the component and gave actionable advice to check the fans.

Round 2: The Security Audit (last)

We provided a login history log containing a hidden crash event.

The Input:

$ last -20

root pts/0 10.10.10.1 Jan 28 11:14 – 18:38 (07:23)
reboot ~ Jan 28 11:12
root pts/0 10.10.10.1 Jan 27 22:09 – System halted abnormally. (13:03)

  • ❌ TinyLlama 1.1B:

    (generated 1 token, then stopped)

    Verdict: Fails. Absolute silence.

  • ✅ LFM2.5 1.2B:

    “Suspicious patterns: The system was halted abnormally on Jan 27. This could indicate a security incident or a misconfiguration. Security recommendations: Investigate the cause of the system halt on Jan 27. Review the login history for any unauthorized access attempts.”

    Verdict: Pass. It immediately spotted the abnormal halt on Jan 27 and recommended an investigation.

Round 3: The Dangerous Advice (/etc/passwd)

We asked the models to audit a standard password file. The results here were shocking.

The Input:

$ cat /etc/passwd

root:!:0:0::/:/usr/bin/bash
daemon:!:1:1::/etc:
bin:!:2:2::/bin:

  • ❌ TinyLlama 1.1B:

    “1. root:!:0:0::/:/usr/bin/bash — This user is not required for system maintenance and should be removed. 2. daemon:!:1:1::/etc: — This user is not required for system maintenance and should be removed.”

    Verdict: CATASTROPHIC FAIL. It claimed the “root” user was not required and recommended deleting it. Following this advice would destroy the server.

  • ✅ LFM2.5 1.2B:

    “Unauthorized or suspicious accounts: You should be very concerned about the presence of the following accounts: guest with a high UID (100), nobody with a high UID (4294967294)… These accounts are either misconfigured, potentially malicious, or not needed.”

    Verdict: Pass. It correctly identified actual potential risks like “guest” and “nobody” accounts with high UIDs.

Round 4: Service Hardening (lssrc -a)

We asked the models to review running services and recommend hardening steps.

The Input:

$ lssrc -a

Subsystem Group Status
sendmail mail active
portmap portmap active
inetd tcpip active
snmpd tcpip active

  • ❌ TinyLlama 1.1B:

    (generated 0 tokens)

    Verdict: Fails. Silence again.

  • ✅ LFM2.5 1.2B:

    “1. sendmail Risk: Running mail daemon can be a target for attacks. Recommendation: Disable sendmail. stopsrc -s sendmail 2. portmap Risk: Running portmap daemon can be abused… stopsrc -s portmap”

    Verdict: Pass. It flagged risky services like sendmail and portmap, and provided the correct AIX command (stopsrc) to disable them.

Why This Matters for IBM Power Users

This benchmark proves that IBM Power Systems are capable AI inference engines for critical, on-premise tasks:

  • Data Sovereignty: Analyze sensitive errpt logs, financial data, or user audits locally. No data leaves your server.
  • Legacy Modernization: Use local LLMs to help understand and document legacy COBOL or C code residing on the server.
  • Efficiency: You don’t need a GPU cluster. You likely already own the hardware capable of doing this.

Try It Yourself

We believe in open source. We have released the AIX port and the converted Big-Endian models.

Code: gitlab.com/librepower/llama-aix
Models: huggingface.co/librepowerai

user@aix:~$ # Quick start on AIX
user@aix:~$ git clone https://gitlab.com/librepower/llama-aix.git
user@aix:~$ ./scripts/build_aix_73.sh

user@aix:~$ # Optimize threading for the "Sweet Spot"
user@aix:~$ smtctl -t 2 -w now

user@aix:~$ # Have fun!

Porting MariaDB to IBM AIX (Part 2): how AIX matches Linux

From “AIX is Slow” to “AIX Matches Linux” (with the right tools and code)

In Part 1, I wrestled with CMake, implemented a thread pool from scratch, and shipped a stable MariaDB 11.8.5 for AIX. The server passed 1,000 concurrent connections, 11 million queries, and zero memory leaks.

Then I ran a vector search benchmark.

AIX: 42 queries per second.
Linux (same hardware): 971 queries per second.

Twenty-three times slower. On identical IBM Power S924 hardware. Same MariaDB version. Same dataset.

This is the story of how we discovered there was no performance gap at all — just configuration mistakes and a suboptimal compiler.

Chapter 1: The Sinking Feeling

There’s a particular kind of despair that comes from seeing a 23x performance gap on identical hardware. It’s the “maybe I should have become a florist” kind of despair.

Let me set the scene: both machines are LPARs running on IBM Power S924 servers with POWER9 processors at 2750 MHz. Same MariaDB 11.8.5. Same test dataset — 100,000 vectors with 768 dimensions, using MariaDB’s MHNSW (Hierarchical Navigable Small World) index for vector search.

The benchmark was simple: find the 10 nearest neighbors to a query vector. The kind of operation that powers every AI-enhanced search feature you’ve ever used.

Linux did it in about 1 millisecond. AIX took 24 milliseconds.

My first instinct was denial. “The benchmark must be wrong.” It wasn’t. “Maybe the index is corrupted.” It wasn’t. “Perhaps the network is slow.” It was a local socket connection.

Time to dig in.

Chapter 2: The First 65x — Configuration Matters

The Cache That Forgot Everything

The first clue came from MariaDB’s profiler. Every single query was taking the same amount of time, whether it was the first or the hundredth. That’s not how caches work.

I checked MariaDB’s MHNSW configuration:

SHOW VARIABLES LIKE 'mhnsw%';
mhnsw_max_cache_size: 16777216

16 MB. Our vector graph needs about 300 MB to hold the HNSW structure in memory.

Here’s the kicker: when the cache fills up, MariaDB doesn’t evict old entries (no LRU). It throws everything away and starts fresh. Every. Single. Query.

Imagine a library where, when the shelves get full, the librarian burns all the books and orders new copies. For every patron.

Fix: mhnsw_max_cache_size = 4GB in the server configuration.

Result: 42 QPS → 112 QPS. 2.7x improvement from one config line.

The Page Size Problem

AIX defaults to 4 KB memory pages. Linux on POWER uses 64 KB pages.

For MHNSW’s access pattern — pointer-chasing across a 300 MB graph — this matters enormously. With 4 KB pages, you need 16x more TLB (Translation Lookaside Buffer) entries to map the same amount of memory. TLB misses are expensive.

Think of it like navigating a city. With 4 KB pages, you need directions for every individual building. With 64 KB pages, you get directions by neighborhood. Much faster when you’re constantly jumping around.

Fix: Wrapper script that sets LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STACKPSIZE=64K@SHMPSIZE=64K

Result: 112 QPS → 208 QPS sequential, and 2,721 QPS with 12 parallel workers.

The Scoreboard After Phase 1

ConfigurationSequential QPSWith 12 Workers
Baseline42~42
+ 4GB cache112
+ 64K pages2082,721

65x improvement from two configuration changes. No code modifications.

But we were still 6x slower than Linux per-core. The investigation continued.

Chapter 3: The CPU vs Memory Stall Mystery

With configuration fixed, I pulled out the profiling tools. MariaDB has a built-in profiler that breaks down query time by phase.

AIX:

Sending data: 4.70ms total
  - CPU_user: 1.41ms
  - CPU_system: ~0ms
  - Stalls: 3.29ms (70% of total!)

Linux:

Sending data: 0.81ms total
  - CPU_user: 0.80ms
  - Stalls: ~0.01ms (1% of total)

The CPU execution time was 1.8x slower on AIX — explainable by compiler differences. But the memory stalls were 329x worse.

The Root Cause: Hypervisor Cache Invalidation

Here’s something that took me two days to figure out: in a shared LPAR (Logical Partition), the POWER hypervisor periodically preempts virtual processors to give time to other partitions. When it does this, it may invalidate L2/L3 cache lines.

MHNSW’s graph traversal is pointer-chasing across 300 MB of memory — literally the worst-case scenario for cache invalidation. You’re jumping from node to node, each in a different part of memory, and the hypervisor is periodically flushing your cache.

It’s like trying to read a book while someone keeps closing it and putting it back on the shelf.

The Linux system had dedicated processors. The AIX system was running shared. Not apples to apples.

But before I could test dedicated processors, I needed to fix the compiler problem.

Chapter 4: The Compiler Odyssey

Everything I Tried With GCC (And Why It Failed)

AttemptResultWhy
-flto (Link Time Optimization)ImpossibleGCC LTO requires ELF format; AIX uses XCOFF
-fprofile-generate (PGO)Build failsTOC-relative relocation assembler errors
-ffast-mathBreaks everythingIEEE float violations corrupt bloom filter hashing
-funroll-loopsSlowerInstruction cache bloat — POWER9 doesn’t like it
-finline-functionsSlowerSame I-cache problem

The AIX Toolbox GCC is built without LTO support. It’s not a flag you forgot — it’s architecturally impossible because GCC’s LTO implementation requires ELF, and AIX uses XCOFF.

Ubuntu’s MariaDB packages use -flto=auto. That optimization simply doesn’t exist for AIX with GCC.

IBM Open XL: The Plot Twist

At this point, I’d spent three days trying to make GCC faster. Time to try something different.

IBM Open XL C/C++ 17.1.3 is IBM’s modern compiler, based on LLVM/Clang. It generates significantly better code for POWER9 than GCC.

Building MariaDB with Open XL required solving five different problems:

  1. Missing HTM header: Open XL doesn’t have GCC’s htmxlintrin.h. I created a stub.
  2. 32-bit AR by default: AIX tools default to 32-bit. Set OBJECT_MODE=64.
  3. Incompatible LLVM AR: Open XL’s AR couldn’t handle XCOFF. Used system /usr/bin/ar.
  4. OpenSSL conflicts: Used -DWITH_SSL=system to avoid bundled wolfSSL issues.
  5. Missing library paths: Explicit -L/opt/freeware/lib for the linker.

Then I ran the benchmark:

Compiler30 QueriesPer-Query
GCC 13.3.00.190s6.3ms
Open XL 17.1.30.063s2.1ms

Three times faster. Same source code. Same optimization flags (-O3 -mcpu=power9).

And here’s the bonus: GCC’s benchmark variance was 10-40% between runs. Open XL’s variance was under 2%. Virtually no jitter.

Why Such a Huge Difference?

Open XL (being LLVM-based) has:

  • Better instruction scheduling for POWER9’s out-of-order execution
  • Superior register allocation
  • More aggressive optimization passes

GCC’s POWER/XCOFF backend simply isn’t as mature. The AIX Toolbox GCC is functional, but it’s not optimized for performance-critical workloads.

Chapter 5: The LTO and PGO Dead Ends

Hope springs eternal. Maybe Open XL’s LTO and PGO would work?

LTO: The Irony

Open XL supports -flto=full on XCOFF. It actually builds! But…

Result: 27% slower than non-LTO Open XL.

Why? AIX shared libraries require an explicit export list (exports.exp). With LTO, CMake’s script saw ~27,000 symbols to export.

LTO’s main benefit is internalizing functions — marking them as local so they can be optimized away or inlined. When you’re forced to export 27,000 symbols, none of them can be internalized. The LTO overhead (larger intermediate files, slower link) remains, but the benefit disappears.

It’s like paying for a gym membership and then being told you can’t use any of the equipment.

PGO: The Profiles That Never Were

Profile-Guided Optimization sounded promising:

  1. Build with -fprofile-generate
  2. Run training workload
  3. Rebuild with -fprofile-use
  4. Enjoy faster code

Step 1 worked. Step 2… the profiles never appeared.

I manually linked the LLVM profiling runtime into the shared library. Still no profiles.

The root cause: LLVM’s profiling runtime uses atexit() or __attribute__((destructor)) to write profiles on exit. On AIX with XCOFF, shared library destructor semantics are different from ELF. The handler simply isn’t called reliably for complex multi-library setups like MariaDB.

Simple test cases work. Real applications don’t.

Chapter 6: The LPAR Revelation

Now I had a fast compiler. Time to test dedicated processors and eliminate the hypervisor cache invalidation issue.

The Test Matrix

LPAR ConfigGCCOpen XL
12 shared vCPUs0.190s0.063s
12 dedicated capped0.205s0.082s
21 dedicated capped0.320s0.067s

Wait. Shared is faster than dedicated?

The WoF Factor

POWER9 has a feature called Workload Optimized Frequency (WoF). In shared mode with low utilization, a single core can boost to ~3.8 GHz. Dedicated capped processors are locked at 2750 MHz.

For a single-threaded query, shared mode gets 38% more clock speed. That beats the cache invalidation penalty for this workload.

Think of it like choosing between a sports car on a highway with occasional traffic (shared) versus a truck with a reserved lane but a speed limit (dedicated capped).

The PowerVM Donating Mode Disaster

There’s a third option: dedicated processors in “Donating” mode, which donates idle cycles back to the shared pool.

ModeGCCOpen XL
Capped0.205s0.082s
Donating0.325s0.085s

60% regression with GCC.

Every time a query bursts, there’s latency reclaiming the donated cycles. For bursty, single-threaded workloads like database queries, this is devastating.

Recommendation: Never use Donating mode for database workloads.

The 21-Core Sweet Spot

With 21 dedicated cores (versus Linux’s 24), Open XL achieved 0.067s — nearly matching the 0.063s from shared mode. The extra L3 cache from more cores compensates for the lack of WoF frequency boost.

Chapter 7: The Final Scoreboard (Plot Twist)

Fresh benchmarks on identical POWER9 hardware, January 2026:

PlatformCores30 Queries
Linux24 dedicated0.057s
AIX + Open XL12 shared0.063s
AIX + Open XL21 dedicated0.067s
AIX + GCC12 shared0.190s
AIX + GCC21 dedicated0.320s

Wait. The AIX system has 21 cores vs Linux’s 24. That’s 12.5% fewer cores, which means 12.5% less L3 cache.

The measured “gap”? 10-18%.

That’s not a performance gap. That’s a hardware difference.

With IBM Open XL, AIX delivers identical per-core performance to Linux. The 23x gap we started with? It was never about AIX being slow. It was:

  1. A misconfigured cache (16MB instead of 4GB)
  2. Wrong page sizes (4KB instead of 64KB)
  3. The wrong compiler (GCC instead of Open XL)

The “AIX is slow” myth is dead.

The Complete Failure Museum

Science isn’t just about what works — it’s about documenting what doesn’t. Here’s our wall of “nice try, but no”:

What We TriedResultNotes
mhnsw_max_cache_size = 4GB5x fasterEliminates cache thrashing
LDR_CNTRL 64K pages~40% fasterReduces TLB misses
MAP_ANON_64K mmap patch~8% fasterMinor TLB improvement
IBM Open XL 17.1.33x fasterBetter POWER9 codegen
Shared LPAR (vs dedicated)~25% fasterWoF frequency boost
Open XL + LTO27% slowerAIX exports conflict
Open XL + PGODoesn’t workProfiles not written
GCC LTOImpossibleXCOFF not supported
GCC PGOBuild failsTOC relocation errors
-ffast-mathBreaks MHNSWFloat corruption
-funroll-loopsWorseI-cache bloat
POWER VSX bloom filter41% slowerNo 64-bit vec multiply on P9
Software prefetchNo effectHypervisor evicts prefetched data
DSCR tuningBlockedHypervisor controls DSCR in shared LPAR
Donating mode60% regressionNever use for databases

The VSX result is particularly interesting: we implemented a SIMD bloom filter using POWER’s vector extensions. It was 41% slower than scalar. POWER9 has no 64-bit vector multiply — you need vec_extract → scalar multiply → vec_insert for each lane, which is slower than letting the Out-of-Order engine handle a scalar loop.

What I Learned

1. Defaults Matter More Than You Think

A 16 MB cache default turned sub-millisecond queries into 24ms queries. That’s a 24x penalty from one misconfigured parameter.

When you’re porting software, question every default. What works on Linux might not work on your platform.

2. The “AIX is Slow” Myth Was Always a Toolchain Issue

With GCC, we were 3-4x slower than Linux. With Open XL, we match Linux per-core.

The platform was never slow. The default toolchain just wasn’t optimized for performance-critical workloads. Choose the right compiler.

3. Virtualization Has Hidden Trade-offs

Shared LPAR can be faster than dedicated for single-threaded workloads (WoF frequency boost). Dedicated is better for sustained multi-threaded throughput. Donating mode is a trap.

Know your workload. Choose your LPAR configuration accordingly.

4. Not Every Optimization Ports

LTO, PGO, and SIMD vectorization all failed on AIX for various reasons. The techniques that make Linux fast don’t always translate.

Sometimes the “obvious” optimization is the wrong choice. Measure everything.

5. Sometimes There’s No Gap At All

We spent days investigating a “performance gap” that turned out to be:

  • Configuration mistakes
  • Wrong compiler
  • Fewer cores on the test system

The lesson: verify your baselines. Make sure you’re comparing apples to apples before assuming there’s a problem to solve.

Recommendations

For AIX MariaDB Users

  1. Use the Open XL build (Release 3, coming soon)
  2. Set mhnsw_max_cache_size to at least 4GB for vector search
  3. Keep shared LPAR for single-query latency
  4. Never use Donating mode for databases
  5. Use 64K pages via the LDR_CNTRL wrapper

For Upstream MariaDB

  1. Increase mhnsw_max_cache_size default — 16MB is far too small
  2. Implement LRU eviction — discarding the entire cache on overflow is brutal
  3. Don’t add POWER VSX bloom filter — scalar is faster on POWER9

What’s Next

The RPMs are published at aix.librepower.org. Release 2 includes the configuration fixes. Release 3 with Open XL build is also available.

Immediate priorities:

  • Commercial Open XL license: Evaluation expires soon. Need to verify with IBM if we are ok using xLC for this purpose.
  • Native AIO implementation: AIX has POSIX AIO and Windows-compatible IOCP. Time to write the InnoDB backend.
  • Upstream MHNSW feedback: The default mhnsw_max_cache_size of 16MB is too small for real workloads; we’ll suggest a larger default.

For organizations already running mission-critical workloads on AIX — and there are many, from banks to airlines to healthcare systems — the option to also run modern, high-performance MariaDB opens new possibilities.

AIX matches Linux. The myth is dead. And MariaDB on AIX is ready for production.

TL;DR

  • Started with 23x performance gap (42 QPS vs 971 QPS)
  • Fixed cache config: 5x improvement
  • Fixed page size: ~40% more
  • Switched to IBM Open XL: 3x improvement over GCC
  • Used shared LPAR: ~25% faster than dedicated (WoF boost)
  • Final result: NO GAP — 10% difference = 12.5% fewer cores (21 vs 24)
  • AIX matches Linux per-core performance with Open XL
  • Open XL LTO: doesn’t help (27% slower)
  • Open XL PGO: doesn’t work (AIX XCOFF issue)
  • POWER VSX SIMD: 41% slower than scalar (no 64-bit vec multiply)
  • Donating mode: 60% regression — never use for databases
  • “AIX is slow for Open Source DBs” was always a toolchain myth

Questions? Ideas? Running MariaDB on AIX and want to share your experience?

This work is part of LibrePower – Unlocking IBM Power Systems through open source. Unmatched RAS. Superior TCO. Minimal footprint 🌍

LibrePower AIX project repository: gitlab.com/librepower/aix

Porting MariaDB to IBM AIX (Part 1): 3 Weeks of Engineering Pain

Bringing MariaDB to AIX, the Platform That Powers the World’s Most Critical Systems

There are decisions in life you make knowing full well they’ll cause you some pain. Getting married. Having children. Running a marathon. Porting MariaDB 11.8 to IBM AIX.

This (Part 1) is the story of the last one — and why I’d do it again in a heartbeat.

Chapter 1: “How Hard Can It Be?”

It all started with an innocent question during a team meeting: “Why don’t we have MariaDB on our AIX systems?”

Here’s the thing about AIX that people who’ve never worked with it don’t understand: AIX doesn’t mess around. When banks need five-nines uptime for their core banking systems, they run AIX. When airlines need reservation systems that cannot fail, they run AIX. When Oracle, Informix, or DB2 need to deliver absolutely brutal performance for mission-critical OLTP workloads, they run on AIX.

AIX isn’t trendy. AIX doesn’t have a cool mascot. AIX won’t be the subject of breathless tech blog posts about “disruption.” But when things absolutely, positively cannot fail — AIX is there, quietly doing its job while everyone else is busy rebooting their containers.

So why doesn’t MariaDB officially support AIX? Simple economics: the open source community has centered on Linux, and porting requires platform-specific expertise. MariaDB officially supports Linux, Windows, FreeBSD, macOS, and Solaris. AIX isn’t on the list — not because it’s a bad platform, but because no one had done the work yet.

At LibrePower, that’s exactly what we do.

My first mistake was saying out loud: “It’s probably just a matter of compiling it and adjusting a few things.”

Lesson #1: When someone says “just compile it” about software on AIX, they’re about to learn a lot about systems programming.

Chapter 2: CMake and the Three Unexpected Guests

Day one of compilation was… educational. CMake on AIX is like playing cards with someone who has a very different understanding of the rules — and expects you to figure them out yourself.

The Ghost Function Bug

AIX has an interesting characteristic: it declares functions in headers for compatibility even when those functions don’t actually exist at runtime. It’s like your GPS saying “turn right in 200 meters” but the street is a brick wall.

CMake does a CHECK_C_SOURCE_COMPILES to test if pthread_threadid_np() exists. The code compiles. CMake says “great, we have it!” The binary starts and… BOOM. Symbol not found.

Turns out pthread_threadid_np() is macOS-only. AIX declares it in headers because… well, I’m still not entirely sure. Maybe for some POSIX compatibility reason that made sense decades ago? Whatever the reason, GCC compiles it happily, and the linker doesn’t complain until runtime.

Same story with getthrid(), which is OpenBSD-specific.

The fix:

IF(NOT CMAKE_SYSTEM_NAME MATCHES "AIX")
  CHECK_C_SOURCE_COMPILES("..." HAVE_PTHREAD_THREADID_NP)
ELSE()
  SET(HAVE_PTHREAD_THREADID_NP 0)  # Trust but verify... okay, just verify
ENDIF()

poll.h: Hide and Seek

AIX has <sys/poll.h>. It’s right there. You can cat it. But CMake doesn’t detect it.

After three hours debugging a “POLLIN undeclared” error in viosocket.c, I discovered the solution was simply forcing the define:

cmake ... -DHAVE_SYS_POLL_H=1

Three hours. For one flag.

(To be fair, this is a CMake platform detection issue, not an AIX issue. CMake’s checks assume Linux-style header layouts.)

The Cursed Plugins

At 98% compilation — 98%! — the wsrep_info plugin exploded with undefined symbols. Because it depends on Galera. Which we’re not using. But CMake compiles it anyway.

Also S3 (requires Aria symbols), Mroonga (requires Groonga), and RocksDB (deeply tied to Linux-specific optimizations).

Final CMake configuration:

-DPLUGIN_MROONGA=NO -DPLUGIN_ROCKSDB=NO -DPLUGIN_SPIDER=NO 
-DPLUGIN_TOKUDB=NO -DPLUGIN_OQGRAPH=NO -DPLUGIN_S3=NO -DPLUGIN_WSREP_INFO=NO

It looks like surgical amputation, but it’s actually just trimming the fat. These plugins are edge cases that few deployments need.

Chapter 3: Thread Pool, or How I Learned to Stop Worrying and Love the Mutex

This is where things got interesting. And by “interesting” I mean “I nearly gave myself a permanent twitch.”

MariaDB has two connection handling modes:

  • one-thread-per-connection: One thread per client. Simple. Scales like a car going uphill.
  • pool-of-threads: A fixed pool of threads handles all connections. Elegant. Efficient. And not available on AIX.

Why? Because the thread pool requires platform-specific I/O multiplexing APIs:

PlatformAPIStatus
LinuxepollSupported
FreeBSD/macOSkqueueSupported
Solarisevent portsSupported
WindowsIOCPSupported
AIXpollsetNot supported (until now)

So… how hard can implementing pollset support be?

(Editor’s note: At this point the author required a 20-minute break and a beverage)

The ONESHOT Problem

Linux epoll has a wonderful flag called EPOLLONESHOT. It guarantees that a file descriptor fires events only once until you explicitly re-arm it. This prevents two threads from processing the same connection simultaneously.

AIX pollset is level-triggered. Only level-triggered. No options. If data is available, it reports it. Again and again and again. Like a helpful colleague who keeps reminding you about that email you haven’t answered yet.

Eleven Versions of Increasing Wisdom

What followed were eleven iterations of code, each more elaborate than the last, trying to simulate ONESHOT behavior:

v1-v5 (The Age of Innocence)

I tried modifying event flags with PS_MOD. “If I change the event to 0, it’ll stop firing,” I thought. Spoiler: it didn’t stop firing.

v6-v7 (The State Machine Era)

“I know! I’ll maintain internal state and filter duplicate events.” The problem: there’s a time window between the kernel giving you the event and you updating your state. In that window, another thread can receive the same event.

v8-v9 (The Denial Phase)

“I’ll set the state to PENDING before processing.” It worked… sort of… until it didn’t.

v10 (Hope)

Finally found the solution: PS_DELETE + PS_ADD. When you receive an event, immediately delete the fd from the pollset. When you’re ready for more data, add it back.

// On receiving events: REMOVE
for (i = 0; i < ret; i++) {
    pctl.cmd = PS_DELETE;
    pctl.fd = native_events[i].fd;
    pollset_ctl(pollfd, &pctl, 1);
}

// When ready: ADD
pce.command = PS_ADD;
pollset_ctl_ext(pollfd, &pce, 1);

It worked! With -O2.

With -O3segfault.

The Dark Night of the Soul (The -O3 Bug)

Picture my face. I have code working perfectly with -O2. I enable -O3 for production benchmarks and the server crashes with “Got packets out of order” or a segfault in CONNECT::create_thd().

I spent two days thinking it was a compiler bug. GCC 13.3.0 on AIX. I blamed the compiler. I blamed the linker. I blamed everything except my own code.

The problem was subtler: MariaDB has two concurrent code paths calling io_poll_wait on the same pollset:

  • The listener blocks with timeout=-1
  • The worker calls with timeout=0 for non-blocking checks

With -O2, the timing was such that these rarely collided. With -O3, the code was faster, collisions happened more often, and boom — race condition.

v11 (Enlightenment)

The fix was a dedicated mutex protecting both pollset_poll and all pollset_ctl operations:

static pthread_mutex_t pollset_mutex = PTHREAD_MUTEX_INITIALIZER;

int io_poll_wait(...) {
    pthread_mutex_lock(&pollset_mutex);
    ret = pollset_poll(pollfd, native_events, max_events, timeout);
    // ... process and delete events ...
    pthread_mutex_unlock(&pollset_mutex);
}

Yes, it serializes pollset access. Yes, that’s theoretically slower. But you know what’s even slower? A server that crashes.

The final v11 code passed 72 hours of stress testing with 1,000 concurrent connections. Zero crashes. Zero memory leaks. Zero “packets out of order.”

Chapter 4: The -blibpath Thing (Actually a Feature)

One genuine AIX characteristic: you need to explicitly specify the library path at link time with -Wl,-blibpath:/your/path. If you don’t, the binary won’t find libstdc++ even if it’s in the same directory.

At first this seems annoying. Then you realize: AIX prefers explicit, deterministic paths over implicit searches. In production environments where “it worked on my machine” isn’t acceptable, that’s a feature, not a bug.

Chapter 5: Stability — The Numbers That Matter

After all this work, where do we actually stand?

The RPM is published at aix.librepower.org and deployed on an IBM POWER9 system (12 cores, SMT-8). MariaDB 11.8.5 runs on AIX 7.3 with thread pool enabled. The server passed a brutal QA suite:

TestResult
100 concurrent connections
500 concurrent connections
1,000 connections
30 minutes sustained load
11+ million queries
Memory leaksZERO

1,648,482,400 bytes of memory — constant across 30 minutes. Not a single byte of drift. The server ran for 39 minutes under continuous load and performed a clean shutdown.

It works. It’s stable. It’s production-ready for functionality.

Thread Pool Impact

The thread pool work delivered massive gains for concurrent workloads:

ConfigurationMixed 100 clientsvs. Baseline
Original -O2 one-thread-per-connection11.34s
-O3 + pool-of-threads v111.96s83% faster

For high-concurrency OLTP workloads, this is the difference between “struggling” and “flying.”

What I Learned (So Far)

  1. CMake assumes Linux. On non-Linux systems, manually verify that feature detection is correct. False positives will bite you at runtime.
  2. Level-triggered I/O requires discipline. EPOLLONESHOT exists for a reason. If your system doesn’t have it, prepare to implement your own serialization.
  3. -O3 exposes latent bugs. If your code “works with -O2 but not -O3,” you have a race condition. The compiler is doing its job; the bug is yours.
  4. Mutexes are your friend. Yes, they have overhead. But you know what has more overhead? Debugging race conditions at 3 AM.
  5. AIX rewards deep understanding. It’s a system that doesn’t forgive shortcuts, but once you understand its conventions, it’s predictable and robust. There’s a reason banks still run it — and will continue to for the foreseeable future.
  6. The ecosystem matters. Projects like linux-compat from LibrePower make modern development viable on AIX. Contributing to that ecosystem benefits everyone.

What’s Next: The Performance Question

The server is stable. The thread pool works. But there’s a question hanging in the air that I haven’t answered yet:

How fast is it compared to Linux?

I ran a vector search benchmark — the kind of operation that powers AI-enhanced search features. MariaDB’s MHNSW (Hierarchical Navigable Small World) index, 100,000 vectors, 768 dimensions.

Linux on identical POWER9 hardware: 971 queries per second.

AIX with our new build: 42 queries per second.

Twenty-three times slower.

My heart sank. Three weeks of work, and we’re 23x slower than Linux? On identical hardware?

But here’s the thing about engineering: when numbers don’t make sense, there’s always a reason. And sometimes that reason turns out to be surprisingly good news.

In Part 2, I’ll cover:

  • How we discovered the 23x gap was mostly a configuration mistake
  • The compiler that changed everything
  • Why “AIX is slow” turned out to be a myth
  • The complete “Failure Museum” of optimizations that didn’t work

The RPMs are published at aix.librepower.org. The GCC build is stable and production-ready for functionality.

But the performance story? That’s where things get really interesting.

Part 2 coming soon.

TL;DR

  • MariaDB 11.8.5 now runs on AIX 7.3 with thread pool enabled
  • First-ever thread pool implementation for AIX using pollset (11 iterations to get ONESHOT simulation right)
  • Server is stable: 1,000 connections, 11M+ queries, zero memory leaks
  • Thread pool delivers 83% improvement for concurrent workloads
  • Initial vector search benchmark shows 23x gap vs Linux — but is that the whole story?
  • RPMs published at aix.librepower.org
  • Part 2 coming soon: The performance investigation

Questions? Ideas? Want to contribute to the AIX open source ecosystem?

This work is part of LibrePower – Unlocking IBM Power Systems through open source. Unmatched RAS. Superior TCO. Minimal footprint 🌍

LibrePower AIX project repository: gitlab.com/librepower/aix

🦙 LLMs on AIX: technical experimentation beyond the GPU hype

At LibrePower, we have published Llama-AIX: a proof-of-concept for running lightweight LLM inference directly on AIX , using only CPU and memory—no GPUs involved.

It’s worth clarifying from the start: this is technical fun and experimentation. It is not a product, not a commercial promise, and not an alternative to large GPU-accelerated AI platforms.

That said, there is a sound technical foundation behind this experiment.

Not all LLM use cases are GPU-bound.

In many common business scenarios in Power environments:

  • RAG (Retrieval Augmented Generation)
  • Questions about internal documentation
  • On-prem technical assistants
  • Semantic search on own knowledge
  • Text analytics with strong dependence on latency and proximity to data

the bottleneck is not always the massive calculation, but:

  • CPU
  • Memory width
  • Data access latency
  • Data localization

In these cases, small and well bounded inferences can be reasonably executed without GPUs, especially when the model is not the center of the system, but just another piece.

⚙️ CPU, MMA and low-power accelerators

The natural evolution does not only involve GPUs:

  • Increasingly vectorized CPUs
  • Extensions as MMA
  • Specific and low-power accelerators (such as the future Spyre)
  • Closer integration to the operating system and data stack

This type of acceleration is especially relevant in Power architectures, where the design prioritizes sustained throughput, consistency and reliability, not just FLOPS peaks.

🧩 Why AIX?

Running this on AIX is not a necessity, it is a conscious choice to:

  • Understanding the real limits
  • Explore its technical feasibility
  • Dismantling simplistic assumptions
  • Learning how LLMs fit into existing Power systems

Many Power customers operate stable, amortized and critical infrastructures, where moving data to the cloud or introducing GPUs is not always desirable or feasible.

🔍 What Llama-AIX is (and isn’t)

  • ✔ A technical PoC
  • ✔ An honest exploration
  • ✔ An engineering exercise
  • ✔ Open source
  • ✖ Not a benchmark
  • ✖ Not a complete AI platform
  • ✖ Not intended to compete with GPU solutions
  • ✖ Not “AI marketing”.

The idea is simple: look beyond the hype, understand the nuances and assess where LLMs bring real value in Power and AIX environments.

Purely out of technical curiosity.

And because experimenting is still a fundamental part of engineering.

💬 In what specific use case would an on-prem LLM in Power make sense to you?

#LibrePower #AIX #IBMPower #LLM #RAG #OpenSource #EnterpriseArchitecture #AIOnPrem

We launched LibrePower (and this is its Manifesto)

We want to unleash the full potential of IBM Power

We build community to grow the Power ecosystem – more solutions, more users, more value.

The most capable hardware you’re not yet taking full advantage of

IBM Power underpins mission-critical computing around the world. Banking transactions, flight reservations, hospital systems, SAP installations – workloads that can’t afford to fail run on Power.

This is no coincidence.

Power systems offer legendary reliability thanks to their RAS (Reliability, Availability, Serviceability) architecture that x86 simply cannot match. They run trouble-free for 10, 15 years or more. Their design – large caches, SMT8, extraordinary memory bandwidth – is designed to keep performance at scale in a sustained manner.

But there is an opportunity that most organizations are missing:

Power can do much more than what is usually asked of it.

The capacity is there. The potential is enormous. What has been missing is independent validation, momentum from the community and accessible tools that open the door to new use cases.

Exactly what LibrePower is building.


What is LibrePower?

LibrePower is a community initiative to extend what is possible in IBM Power – across the entire ecosystem:

  • AIX – The veteran Unix that runs the most demanding enterprise loads
  • IBM i – The integrated system that thousands of companies around the world run on
  • Linux on Power (ppc64le) – Ubuntu, Rocky, RHEL, SUSE, Fedora on POWER Architecture

We are not here to replace anything. We come to add:

  • More certified solutions running on Power
  • More developers and administrators relying on the platform
  • More reasons to invest in Power – both in new and existing equipment

What we do

1. LibrePower Certified – independent validation

ISVs and open source projects need to know that their software works on Power. Buyers need confidence before deploying. IBM certification has its value, but there is room for independent community-driven validation.

LibrePower Certified offers three levels:

LevelMeaning
Works on PowerCompile and run correctly on ppc64le. Basic validation.
Optimized for PowerTuned for SMT, VSX/MMA. Includes performance data.
🏆 LibrePower CertifiedFull validation + case study + ongoing support.

Free for open source projects. Premium levels for ISVs looking for deeper collaboration.

Open source repositories

We compile, package and distribute software that the Power community needs:

  • AIX(aix.librepower.org): modern CLI tools, security utilities, compatibility layers
  • ppc64le: container tools (AWX, Podman), development utilities, monitoring
  • IBM i: open source integration (under development)

Everything is free. Everything is documented. Everything is in GitLab.

3. Performance testing and optimization

Power hardware has unique features that most software does not take advantage of. We benchmark, identify opportunities and work with upstream projects to improve performance on Power.

When we find optimizations, we contribute them back. The entire ecosystem benefits.

4. Building community

The Power world is fragmented. AIX administrators, Linux on Power teams, IBM i environments – all solving similar problems in isolation.

LibrePower connects these communities:

  • Cross-platform knowledge sharing
  • Amplify the collective voice to manufacturers and projects
  • Create a network of expertise in Power

5. Expanding the market

Every new solution validated in Power is one more reason for organizations to choose the platform. Every developer who learns Power is talent for the ecosystem. Every success story demonstrates value.

More solutions → more adoption → stronger ecosystem → more investment in Power.

This virtuous circle benefits everyone: IBM, partners, ISVs and users.

Why should you care?

If you manage Power systems:

  • Access tools and solutions you were missing
  • Join a community that understands your environment
  • Maximize the return on your hardware investment

If you are a developer:

  • Learn a platform with unique technical features
  • Contribute to projects with real impact in the enterprise world.
  • Develops expertise in a high value niche

If you are an ISV:

  • Get your software validated in Power
  • Connect with enterprise customers
  • Discover market opportunities in the Power ecosystem

If you are evaluating infrastructure:

  • Find out what’s really possible in Power beyond traditional charging
  • Find independent validation of solutions
  • Connect with the community to learn about real experiences

What we are working on

AIX(aix.librepower.org)

  • ✅ Modern package manager (dnf/yum for AIX)
  • ✅ fzf – fuzzy search engine (Go binary compiled for AIX)
  • ✅ nano – modern editor
  • 2FA tools – Google Authenticator with QR codes
  • 🔄 Coming soon: ripgrep, yq, modern coreutils

Linux ppc64le

  • 🔄 AWX – Ansible automation (full port in progress)
  • 🔄 Container Tools – Podman, Buildah, Skopeo
  • 🔄 Development tools – lazygit, k9s, modern CLIs

IBM i

  • 📋 Planning phase – assessing priorities with community input.

The vision

Imagine:

  • Every major open source project considers Power at the time of release
  • ISVs see Power as a strategic platform, not an afterthought
  • Organizations deploy new loads on Power with confidence
  • A connected and growing community that powers the ecosystem

That’s The Power Renaissance.

It is not nostalgia for the past. It is not just extending the life of existing deployments.

Active expansion of what Power can do and who uses it.


Join

LibrePower grows with the community. This is how you can participate:

Who is behind it?

LibrePower is an initiative of SIXE, an IBM and Canonical Business Partner with more than 20 years of experience in the Power ecosystem.

We have seen what Power can do. We’ve seen what’s missing. Now we build what should exist.

LibrePower – Unlocking the potential of Power Systems through open source software 🌍. Unmatched RAS. Superior TCO. Minimal footprint. 🌍

What is OpenUEM? The Open Source revolution for device management

In the world of system administration, we often encounter oversized, expensive and complex tools. However, when we analyze what people are most looking for in terms of efficient alternatives, the name OpenUEM starts to ring a bell.

From SIXE, as specialists in infrastructure and open source, we want to answer the most frequently asked questions about this technology and explain why we have opted for it.

OpenUEM

What is OpenUEM and how does it work?

OpenUEM(Unified Endpoint Management) is an open source solution designed to simplify the lives of IT administrators. Unlike traditional suites that charge per agent or device, OpenUEM offers a centralized platform for inventory, software deployment and remote management of equipment without abusive licensing costs.

Its operation is very good because of its simplicity and efficiency:

  1. The agent: A small program is installed on the end equipment.

  2. The server: Collects the information in real time and allows to execute actions.

  3. The web console: From a browser, the administrator can view the entire IT park, install applications or connect remotely.

OpenUEM vs. other traditional tools

One of the most common doubts is how this tool compares to the market giants. We have made a list of pros and cons with SIXE’s perspective, so you can draw your own conclusions :)

  • In favor:

    • Cost: Being Open Source, you eliminate licensing costs. Ideal for SMBs and large deployments where the cost per agent skyrockets.

    • Privacy: It’s self-hosted. You control the data, not a third-party cloud.

    • Lightweight.

  • Against:

    • Being a younger tool, it may not (yet) have the infinite plugin ecosystem of solutions that have been on the market for 20 years. However, it more than covers 90% of the usual management needs.

How to integrate OpenUEM with your current IT infrastructure?

Integration is less traumatic than it seems. OpenUEM is designed to coexist with what you already have.

  • Software deployment: Integrates natively with repositories such as Winget (Windows) and Flatpak (Linux), using industry standards instead of closed proprietary formats.

  • Remote support: Incorporates proven technologies such as VNC, RDP and RustDesk so you can support remote employees without complex VPN configurations in many cases.

If you’re wondering how to set up OpenUEM to monitor employees remotely, the answer lies in its flexible architecture. The server can be deployed via Docker in minutes, allowing agents to report securely from any location with internet.

Who offers OpenUEM support and solutions for companies?

Although the software is free, companies need guarantees, support and a professional implementation. This is where we come in. At SIXE, we don’t just implement the tool; we offer the necessary business support so you can sleep easy. We know that integrating a new platform can raise questions about pricing or deployment models for small and medium-sized businesses. That’s why our approach is not to sell you a license (there aren’t even any), but to help you deploy, maintain and secure your device management infrastructure with OpenUEM.

Contact us!

If you are looking for a platform to manage your mobile and desktop devices that is transparent, auditable and cost-effective, OpenUEM may be a solution for you. Want to see how it would work in your environment? Take a look at our professional OpenUEM solution and find out how we can help you manage the control of your devices. For those who are more curious or want to play around with the tool on their own, we always recommend visiting the official OpenUEM repository.

SIXE