What Is an AI Factory and How to Build One with Open Source


AI Infrastructure · March 2026

What is an AI Factory and how to build one with open source in your own data centre

The AI Factory concept has been everywhere for two years — but few organisations actually understand what it takes to build one, or how to do it without being locked into a cloud provider. Here’s a straight-talking breakdown, with the specific stack we use in production.

March 202620 min read

An AI factory is not a server with a GPU and a model downloaded from Hugging Face. It is a distributed compute infrastructure designed to run language and vision models in production continuously, at scale, and under full organisational control. The good news: building one is no longer the exclusive privilege of hyperscalers. The open source technology powering the Barcelona Supercomputing Center’s AI Factory, and the sovereign AI infrastructure programmes across Europe, is available to any organisation with its own data centre. What follows is a practical guide to what you need, what you don’t, and how to decide whether it makes sense for you.

Just a bit of context

What exactly is an AI factory?

The term “AI Factory” was popularised by NVIDIA’s Jensen Huang in 2023 to describe what data centres are becoming: machines that produce intelligence continuously, the way a factory produces goods. The metaphor isn’t poetic — it’s technically precise.

A classic AI factory has four distinct components: a storage system for model weights and datasets (which run to tens or hundreds of gigabytes), a GPU compute layer for running inference, an orchestrator managing which model runs on which hardware, and an API that exposes models to the rest of the organisation. When those four components work efficiently together, you have an AI factory.

What differentiates it from “having an LLM running on a server” is scale, reliability, and management. An AI factory serves multiple models in parallel, handles request queuing, guarantees availability, and monitors resource usage. It’s production infrastructure — not a test environment.

Worth knowing

The European Commission has committed over €1.5 billion to building AI Factories distributed across member states under the EuroHPC programme. The explicit goal is for Europe to have sovereign AI infrastructure that doesn’t depend on US or Asian providers. Spain participates through the BSC. The same technology stack they use can be deployed in your data centre.

Why bring AI inference in-house?

Why organisations are moving their AI workloads on-premises

Three arguments come up in every conversation we have with clients evaluating on-premise AI infrastructure. These aren’t marketing talking points — they’re operational and financial realities.

💸

Predictable costs

GPU bills in public cloud can swing 30–40% between billing cycles depending on demand. With your own infrastructure, the cost is fixed, depreciable, and entirely predictable. At medium inference volumes, the investment typically pays back in 12–18 months compared to cloud.

🔓

Zero vendor lock-in

Proprietary APIs, closed formats, fine-tuned models living in someone else’s infrastructure. With an open source stack, your models and your data are yours — always portable, no exit negotiations, no lock-in contracts.

🏛️

Regulatory compliance

GDPR and the EU AI Act require knowing precisely where data is processed. If your inference touches patient data, citizen records, or financial data, you need full control over the infrastructure. On-premises is the only architecturally sound answer.

The question is no longer whether to build your own AI infrastructure, but when and how to do it without repeating the mistakes of the cloud rush a decade ago: velocity without architecture.
— SIXE Technical Team

That said, on-premise AI infrastructure isn’t right for everyone. If you’re running ten inference requests a day and have no strict regulatory requirements, cloud is probably the right answer right now. On-premises starts making sense when volumes are sustained, when data is sensitive, or when you need to run proprietary fine-tuned models without exposing weights to third parties.

OK so… how do I actually build this?

The open source stack: three technologies, zero proprietary dependencies

A combination of three technologies has emerged as the de facto standard for building on-premise AI factories in European enterprise environments. The same stack the BSC uses. The same technologies driving sovereign AI infrastructure in France, Germany, and Italy. And the same stack we deploy at SIXE.

Ceph: distributed storage built for AI workloads

Language models are heavy. Llama 3 70B occupies roughly 40 GB in float16 precision. Mixtral 8x7B sits around 90 GB. A reasonable model catalogue for a mid-sized organisation can easily exceed 500 GB — before accounting for fine-tuning datasets or inference logs.

Ceph solves this with distributed storage that unifies object storage (natively S3-compatible), block storage, and filesystem in a single cluster. It scales from terabytes to petabytes without interruption, supports erasure coding for storage efficiency, and has native Kubernetes integration via CSI. In an AI factory, Ceph acts as the backbone where model weights, datasets, and inference results live.

SIXE Perspective

We are a Canonical Partner and have been deploying Ceph clusters in production for years, including AI and HPC environments. Ceph is not “tick a checkbox”: it requires careful sizing, network design, and replication policies adapted to your workload. Three-node clusters have quorum considerations that shouldn’t be improvised. We offer dedicated training and support so your team can operate Ceph with real autonomy — not permanent consulting dependency.

OpenStack: your private cloud with native GPU management

OpenStack turns your hardware into a private enterprise cloud. For an AI factory, its primary role is GPU resource management: PCI passthrough for direct GPU access from VMs, vGPU for sharing a physical GPU across multiple workloads, and NVIDIA MIG (Multi-Instance GPU) for partitioning A100 and H100 GPUs into independent instances.

Under the Linux Foundation since 2024, OpenStack runs in production across more than 45 million cores at organisations including Walmart, GEICO, and LINE Corp. This isn’t emerging technology — it’s proven infrastructure at real scale, with independent governance that guarantees continuity.

Worth noting

OpenStack is not trivial. It spans more than 30 service projects and requires teams with experience in distributed systems. If your team comes from a VMware background, the learning curve is real. Our training service covers practical, hands-on upskilling so your team can operate the stack independently — without long-term consulting dependency.

Kubernetes + vLLM: the inference orchestration layer

Kubernetes is the CNCF standard for container workload orchestration, with native GPU scheduling via the NVIDIA GPU Operator. The inference engines are deployed on top of Kubernetes — and vLLM is the most significant for language models right now.

vLLM implements PagedAttention, a technique that manages KV cache memory efficiently and enables parallel serving of multiple requests without wasting VRAM. In representative benchmarks, vLLM delivers 3–5x the throughput of a naive implementation of the same model. It exposes an OpenAI-compatible API, which makes migrating applications already consuming GPT-4 or similar models straightforward.

For vision or embedding models, NVIDIA’s Triton Inference Server complements vLLM and enables hardware-specific optimisations such as TensorRT-LLM.

How does an AI factory take shape?

Reference architecture: from data to model in production

An on-premise AI factory with this stack follows a four-layer flow. It’s not the only possible design, but it’s the one that best balances operational complexity, performance, and portability.

01 — Data

Ceph S3

Models, datasets, inference results. S3-compatible API for integration with MLOps pipelines.

02 — Compute

OpenStack

GPU scheduling, bare metal, isolated project networks. PCI passthrough and MIG for maximum efficiency.

03 — Orchestration

Kubernetes

GPU Operator, inference pod autoscaling, deployment lifecycle management.

04 — Production

vLLM / Triton

Inference APIs, RAG, agents. OpenAI-compatible endpoints for friction-free integration.

The key to this design is that each layer is independent and replaceable. If a better orchestrator than Kubernetes emerges for AI workloads tomorrow, you can swap it out without touching storage or the compute layer. That’s what zero vendor lock-in really means: not just open source software, but genuine separation of concerns in the architecture.

Component
Role in the factory
Viable alternatives
Governance

Ceph
Model and data storage
IBM Storage Scale (GPFS)
Linux Foundation

OpenStack
Private cloud with GPU management
MaaS + direct bare metal
OpenInfra / LF

Kubernetes
Container orchestration
MicroK8s, OpenShift
CNCF / LF

vLLM
LLM inference engine
Triton, TensorRT-LLM
Apache 2.0

Ubuntu / Canonical
Base OS + stack support
RHEL, SUSE
Canonical Partner

Is this right for my organisation?

Who actually needs an on-premise AI factory

Not every sector has the same urgency or the same constraints. In four areas, on-premise AI infrastructure isn’t a preference — it’s the only architecturally viable answer.

🏥

Healthcare & pharma

Clinical records, diagnostic imaging, genomic data. GDPR and the EU Health Data Space Regulation impose strict restrictions on transfers to third countries without explicit safeguards. On-premise inference is the default compliance architecture.

🏦

Banking & insurance

Credit scoring, fraud detection, risk analysis. EBA guidelines on AI in financial services and the EU AI Act classify these systems as high-risk, with traceability and control requirements that only on-premises architectures can meet.

🏛️

Public sector & defence

Technological sovereignty, NIS2, classified data. The EU’s AI strategy explicitly requires that public-facing AI systems operate on European or national infrastructure. No discussion needed.

🏭

Industry & manufacturing

Computer vision on production lines, predictive maintenance, quality control. Cloud latency is not viable when you need millisecond response times on the factory floor. Edge or on-premises inference is the only workable model.

FAQ

Questions to answer before you start

Building an on-premise AI factory is not a weekend project. It requires honest prior analysis across four dimensions that determine whether it makes sense and how to execute it well.

Which models are you serving, and at what request volumes?

GPU sizing depends directly on model size (parameter count and precision) and throughput targets (requests per second, acceptable P99 latency). A 7B parameter model in float16 fits in a single L40S GPU with 48 GB of VRAM. A 70B model requires multiple GPUs with tensor parallelism. There are no shortcuts here: correct sizing requires knowing real workloads, not optimistic estimates.

Does your team have the capacity to operate this stack?

The most important question — and the one asked least often. A team with experience in Linux, Kubernetes and distributed systems can learn to operate this stack. But if you’re starting from scratch, the learning curve needs to be inside the plan, not outside it. SIXE offers certified training in Ceph, OpenStack and Kubernetes (as an IBM Business Partner and Canonical Partner) precisely so the transition doesn’t create permanent consulting dependency.

What is the real 3-year TCO?

The software is open source, so there are no licence costs. The investment is hardware (GPUs, servers, high-speed networking) plus team upskilling. Compared to cloud GPU costs at the same inference volume over that period, the numbers tend to speak for themselves. But the financial model must include maintenance, updates, and staff operating time. Nothing is free — and projects that start from that assumption tend to hit unpleasant surprises.

How we work at SIXE

Before any deployment, we carry out an architecture assessment: we audit your real workloads, latency requirements, data volumes, and regulatory obligations. We deliver a complete design — GPU sizing, network topology, Ceph storage layout, and a 12–24 month capacity plan. No smoke, no savings promises we haven’t calculated. Just a technical analysis of whether it makes sense, and how to execute it.


Have an AI inference project in mind?

Your AI factory, built with the stack we use ourselves

IBM Business Partner and Canonical Partner. Over 15 years deploying open source in production. We design the architecture, train your team, and support you until the infrastructure runs on its own. Our work ends when yours truly begins.

10 Architecture & ROI Mistakes in the Post-VMware Exodus | SIXE

Technical Analysis · February 2026

10 Architecture & ROI Mistakes Nobody Assessed in the Post‑VMware Exodus

Open source VMware alternatives are viable. But “viable” and “right for your business” are radically different questions — ones that demand technical, financial and governance analysis before you commit your infrastructure for a decade.

February 202625 min readFor CIOs, CTOs & Infrastructure Directors
86% of VMware customers are actively reducing their footprint. Only 4% have completed the migration. Between those two numbers lies a chasm of technical, financial and strategic decisions that most organisations are making with more urgency than rigour. This article doesn’t recommend any platform — it lays out the questions every leadership team should answer before committing their infrastructure for the next 5–10 years. Spoiler: there’s no magic answer, but there is a sensible path forward.

Mistake 01 of 10

Confusing urgency with strategy

Broadcom’s acquisition of VMware in November 2023 for $61 billion triggered the biggest upheaval in enterprise virtualization in over a decade. And “upheaval” is putting it mildly. Perpetual licences eliminated, ~8,000 SKUs consolidated into 4 bundles, a 72-core purchase minimum and reported price increases ranging from 150% to 1,500%. Tesco filed a £100 million lawsuit. Fidelity warned of risks to 50 million customers. Safe to say, the landscape encouraged a stampede.

And that’s exactly what happened: a lot of people have been running — just not always in the right direction. A CloudBolt survey (2026) found that 63% have changed their migration strategy at least twice (yes, twice). Gartner estimates VMware’s market share will fall from 70% to 40% by 2029, but the road there is full of twists that deserve considerably more thought than they’re getting.

SIXE Perspective

The urgency Broadcom has created is entirely understandable — nobody enjoys having their bill multiplied overnight. But every infrastructure decision made under pressure becomes technical debt your teams will inherit for years. The first right decision is to separate the immediate tactical response from the medium-term strategy and evaluate each on its own terms. Breathe, plan, then act.

Mistake 02 of 10

Ignoring open source project governance

Not all open source is created equal — far from it. The most relevant difference for a long-term business decision isn’t the licence, but something almost nobody checks: who controls the project and what mechanisms protect the community if commercial priorities shift.

Proxmox Server Solutions GmbH is a private Austrian company with €35,000 in share capital and an estimated team of 14–24 people. Great people, no doubt, but there’s no independent foundation, no open governance board, and no community representation in development decisions. In other words: the future of your virtualization platform depends on a single company’s choices.

Compare that to the MariaDB Foundation, where no single company can hold more than 25% of board seats — a safeguard that protected the project when MariaDB Corporation was acquired by K1 in September 2024. Or OpenStack, now under the Linux Foundation, with governance distributed across hundreds of organisations. Now that’s a safety net.

Key question

Is your virtualization platform — the one that will run your business applications for the next 10 years — governed by an independent foundation, a consortium, or a single private company with fewer than 25 employees? This isn’t a rhetorical question: the answer has direct implications for long-term vendor lock-in risk.

Mistake 03 of 10

Not reading the Contributor Licence Agreement

We know — reading a CLA isn’t exactly a Friday night plan. But it’s worth it. The Proxmox CLA grants the company a perpetual, worldwide, irrevocable licence over all contributions, with the right to relicence them under commercial or proprietary terms. This mechanism isn’t inherently problematic, but it’s exactly the structural combination (single company + AGPL + permissive CLA) that preceded every major licence change of the past seven years. It’s like watching storm clouds gather and saying “I’m sure it won’t rain.”

ProjectYearChangeConsequenceGovernance
MongoDB2018AGPL → SSPLDropped by Debian/Red HatSingle vendor
Elasticsearch2021Apache 2.0 → SSPLFork: OpenSearch (Linux Foundation)Single vendor
HashiCorp2023MPL 2.0 → BSLFork: OpenTofu · IBM: $6.4BSingle vendor
Redis2024BSD → SSPLFork: Valkey (Linux Foundation)Single vendor
MinIO2021–26Apache → AGPL → AbandonedRepo: “NO LONGER MAINTAINED”Single vendor
KubernetesApache 2.0 stableFoundation (CNCF)
PostgreSQLPostgreSQL Licence stableCommunity
LinuxGPLv2 stableFoundation (LF)

See the pattern? It’s pretty clear: no project backed by an independent foundation has ever suffered a unilateral licence change. Not a single one. This fact should inform any risk assessment, regardless of which platform you’re considering.

Mistake 04 of 10

Assuming subscription costs will stay flat

Every company that develops open source software needs to monetise its work, and that’s entirely fair — nobody works for free. The question isn’t whether prices will go up (they will, as with everything), but whether you’re factoring that into your TCO model.

Proxmox’s Community subscription went from €49.90 to €120/year (~140% increase), and in January 2026, all tiers rose another 3.8–4.3%. The new Proxmox Datacenter Manager requires at least 80% of nodes to carry Basic or higher subscriptions (€370+/socket/year). Sound familiar? It does to us, too.

OpenStack, Ceph and other VMware alternatives also have their own cost structures. No platform is free in production — if anyone tells you otherwise, smile politely and ask for the receipt. The real difference lies in which costs are predictable and which hinge on unilateral decisions.

SIXE Perspective

When we assess alternatives with our clients, we always model three cost scenarios: optimistic, realistic and adverse, with 5- and 10-year projections that factor in potential licensing changes. Yes, it’s more work. But it’s the only way to build a TCO that won’t crumble at the first price shift.

Mistake 05 of 10

Underestimating the operational complexity of OpenStack

Let’s give credit where it’s due: OpenStack runs in production with over 45 million cores at organisations like Walmart, GEICO and LINE Corp. Its governance — now under the Linux Foundation — is among the strongest in the ecosystem. These are genuine, weighty advantages.

But (there’s always a but) the project itself acknowledges that 44% of IT vendors and 39% of enterprises report difficulty finding qualified professionals. The platform comprises over 30 service projects. Deploying and operating that stack requires teams with distributed systems experience that most VMware teams don’t have — not for lack of talent, but because these are fundamentally different skill sets. It’s like asking a brilliant Mediterranean chef to compete in a sushi championship: both are world-class cooking, but the skills don’t transfer automatically.

Important nuance

OpenStack can be the perfect choice for organisations with the right scale and the right teams. Managed offerings (Canonical, Mirantis, Platform9) solve part of the complexity, though they add cost and dependency. The question isn’t “OpenStack yes or no?” but “do our team, budget and scale match what OpenStack needs to thrive?”

Mistake 06 of 10

Assuming Ceph is “just tick the checkbox”

Ceph is powerful and runs in production at impressive places: CERN, Bloomberg and DigitalOcean, among others. But the gap between a hyperscale Ceph cluster and the typical 3–5 node deployment in a VMware migration is like the gap between flying an Airbus and a microlight: both fly, but the rules are very different.

StarWind benchmarks in Proxmox HCI environments showed Ceph reaching 270,000 IOPS in mixed 4K workloads, compared to 1,088,000 IOPS for LINSTOR/DRBD and 1,199,000 for StarWind VSAN. And the risks of small clusters deserve close attention: losing one node in a 3-node cluster with 3x replication can leave the system in read-only mode. Not exactly what you want at 3 AM on a Monday.

Alternatives worth considering include LINSTOR/DRBD with near-native performance, StorPool with sub-100 µs latencies, and IBM Storage Virtualize with proven enterprise integration. Each has its strengths, limitations and expertise requirements.

SIXE Perspective

The storage layer is arguably the most critical and least reversible decision in the entire migration. It’s where you can’t afford to wing it. It must be evaluated with real testing on real workloads — not generic benchmarks or “it works great for someone I follow on LinkedIn.” This is precisely where an integrator with experience across multiple storage platforms adds the most value.

Mistake 07 of 10

Not auditing the enterprise features you’ll lose

VMware has spent over a decade building capabilities that many organisations have baked into their operational workflows. They’re the sort of things that “just work” — until they’re gone. And when they disappear, the impact goes well beyond technology.

⚖️

Automatic load balancing (DRS)

Proxmox has no native equivalent: you’ll need custom scripting. OpenStack offers partial functionality via Nova scheduling. Either way, prepare to roll up your sleeves.

🛡️

Fault Tolerance & DR

VMware FT/SRM deliver automatic failover. Open source alternatives require custom orchestration with ZFS replication, Ansible and manual runbooks. It works, but someone has to build and maintain it.

🌐

SDN & microsegmentation

Proxmox SDN supports VLANs/VXLANs/EVPN, but IPAM/DHCP are in “tech preview” (read: not quite ready for prime time). There’s no equivalent to the NSX distributed firewall.

📋

Vendor certifications

SAP, Oracle and Microsoft don’t certify Proxmox. NVIDIA AI Enterprise isn’t officially supported either. If you depend on these certifications, this is a detail you don’t want to overlook.

🔧

Automation & API

The Terraform providers for Proxmox are community-maintained. The API requires manually specifying the target node. None of this is a deal-breaker, but these frictions add up.

📞

24/7 support

Proxmox operates on Austrian business hours (7:00–17:00 CET), with no 24/7 option at any tier. When production goes down at 3 AM, there’s literally nobody to call. And no, Google doesn’t count.

None of these limitations invalidate the platform — let’s be clear about that. But each one represents a gap you’ll need to fill with engineering, tooling or consultancy, and every workaround carries a cost that belongs in your financial model. Pretending they don’t exist is a recipe for unpleasant surprises.

Mistake 08 of 10

Calculating ROI on the licensing line alone

The licence savings are real. But presenting those savings as the project’s ROI is like valuing a house move solely by the cost of the van: technically correct, practically incomplete.

Gartner estimates migration services cost between $300 and $3,000 per VM. 44% of organisations experience unplanned downtime during migration. And projects estimated at 6 months routinely turn into 24 — a pattern that’s by now very well documented.

The hidden costs that erode your ROI

Training: $5,000–15,000/engineer + 3–6 months of reduced productivity (your team learns fast, but not overnight). Integration: backup, monitoring, CMDB, automation — mature connectors for VMware that simply don’t exist for Proxmox. Custom development: load balancing scripts, DR, monitoring = internal technical debt. Turnover: when the engineer who built them leaves, the knowledge walks out the door. And it always happens at the worst possible time.

Mistake 09 of 10

Designing for Day 1 and ignoring Day 2

The migration is just the beginning — the honeymoon, if you like. The real cost shows up in day-to-day operations, year after year.

Cybernews found that many organisations that migrated to Proxmox aren’t keeping up with security updates. It’s understandable: when all the effort goes into migrating, it’s easy to forget that afterwards you still have to operate. At scale, the UI becomes unresponsive with several thousand VMs, and pmxcfs hits its limits around 11,000 VMs.

🔐

Security & compliance

CVE management, hardening, audits (ISO 27001, PCI DSS, SOC 2). What SLA for critical vulnerability response does each platform offer? An awkward question, but a necessary one.

📊

Telemetry & observability

Open source alternatives (Prometheus, Grafana, Zabbix) are excellent — they truly are — but they require dedicated integration and maintenance. They don’t configure themselves.

💾

Backup & recovery

Proxmox Backup Server is functional, but it plays in a different ecosystem league compared to Veeam, Commvault or IBM Spectrum Protect.

🏗️

Technical debt

Every custom script is code that needs maintaining, documenting, testing and transferring. Technical debt is invisible until it isn’t — and then it’s the only thing you can see.

Mistake 10 of 10

Thinking that technology is the decision

Migrating away from VMware isn’t a technology project: it’s an operational transformation that touches people, processes, vendors, budgets and risk. Technology matters, of course. But it’s just one piece of the puzzle.

The VMware migration isn’t a technology problem. It’s an operational decoupling problem disguised as a technology problem.— Keith Townsend, The CTO Advisor

The questions that actually matter: what level of governance risk is acceptable to us? Do we have the team we need — or can we upskill in time? What’s our real TCO at 5 and 10 years? How do we protect our investment if the open source project changes course? Which workloads do we migrate first, and which ones should perhaps never move? And who will be our technical partner when things don’t go as planned? (Because at some point, they won’t. That’s life.)

Our conviction

Open source is, without question, the right answer for the infrastructure of the future. We have zero doubt about that. The question isn’t whether to migrate, but how to do it with the rigour the decision deserves. The difference between a successful migration and one that generates years of headaches comes down to the quality of the upfront analysis, the architecture chosen and expert support throughout the process. And hey, if after reading all of this you’d like to talk, we’re right here.


The best migration is one made with judgement, not haste

Over 15 years designing, deploying and operating mission-critical infrastructure. We know VMware, Proxmox, OpenStack, Ceph, IBM Storage and the alternatives because we work with all of them. No favourites — just the solution that fits your business.

Running Liquid AI’s New Model on IBM AIX (No GPU Required)

Forget the H100 clusters for a moment. At SIXE, we decided to push enterprise hardware to its absolute limits to answer a burning question: Can a 2018-era IBM Power System, running AIX and relying purely on CPU, handle the latest generation of AI models?

We took Liquid AI’s new LFM2.5-1.2B model and ran it on an IBM POWER9 processor. To our knowledge, this is the first time an LFM2.5 model has ever run on AIX in Big-Endian mode.

The Result?

Nearly 27 tokens per second, coherent responses, and under 750 MB of memory usage. No GPU. No NPU. Just raw Power architecture muscle.

But raw speed is only half the story. To prove this isn’t just a benchmark toy, we put LFM2.5 through a “SysAdmin Gauntlet”—real AIX administrative tasks—and compared it against a standard Transformer (TinyLlama 1.1B). The results were shocking.

The “Secret Sauce”: What is LFM2.5?

LFM2.5 is a hybrid architecture designed for extreme efficiency, mixing Convolutional blocks (shortconv) for speed and Attention layers (GQA) for context. It features a massive 128k context window—enough to read thousands of lines of logs without forgetting the beginning.

The Hardware: IBM Power System S924

We used the workhorse of the enterprise world. Below are the specific configurations used for this benchmark:

SpecValue
ServerIBM Power System S924
ProcessorIBM POWER9 @ 2.75 GHz
ConfigurationEnterprise Multi-Core (SMT-8 capable)
OSAIX 7.3 TL4
ArchitectureBig-Endian

Performance: Finding the Sweet Spot

We found that throwing every core at the model actually hurts performance due to synchronization overhead. The verdict was clear: Using just 8 cores in SMT-2 mode (16 threads) gave us 26.7 tokens per second.


Real-World Test: The SysAdmin Gauntlet

Numbers are nice, but can it actually work? We ran the exact same prompts through both LFM2.5-1.2B and TinyLlama 1.1B (a pure transformer of similar size). Here are the raw outputs.

Round 1: The Cryptic Error (errpt)

We fed the models a raw AIX error log regarding a power supply failure.

The Input:

$ errpt -a -l BFE4C025

Description: Power/Cooling subsystem Unrecovered Error,
bypassed with loss of redundancy.
Possible FRUs:
Priority: H
FRU: PWRSPLY
Location: U78D2.001.WZS00P4

  • ❌ TinyLlama 1.1B:

    “7. What is the location of the FRU that is affected by this error? The FRU that is affected by this error is PWRSPLY. 8. What is the description of the FRU that is affected by this error? The FRU that is affected by this error is PWRSPLY…”

    Verdict: Fails. It got stuck in an infinite loop repeating “PWRSPLY”.

  • ✅ LFM2.5 1.2B:

    “What is the likely cause? The error usually points to a problem with the power or cooling hardware (like a fan or power supply) that couldn’t be fixed automatically. It might be a hardware failure or a misconfiguration. What action should I take? First, check the hardware — look at the fans (especially the power fans)…”

    Verdict: Pass. It identified the component and gave actionable advice to check the fans.

Round 2: The Security Audit (last)

We provided a login history log containing a hidden crash event.

The Input:

$ last -20

root pts/0 10.10.10.1 Jan 28 11:14 – 18:38 (07:23)
reboot ~ Jan 28 11:12
root pts/0 10.10.10.1 Jan 27 22:09 – System halted abnormally. (13:03)

  • ❌ TinyLlama 1.1B:

    (generated 1 token, then stopped)

    Verdict: Fails. Absolute silence.

  • ✅ LFM2.5 1.2B:

    “Suspicious patterns: The system was halted abnormally on Jan 27. This could indicate a security incident or a misconfiguration. Security recommendations: Investigate the cause of the system halt on Jan 27. Review the login history for any unauthorized access attempts.”

    Verdict: Pass. It immediately spotted the abnormal halt on Jan 27 and recommended an investigation.

Round 3: The Dangerous Advice (/etc/passwd)

We asked the models to audit a standard password file. The results here were shocking.

The Input:

$ cat /etc/passwd

root:!:0:0::/:/usr/bin/bash
daemon:!:1:1::/etc:
bin:!:2:2::/bin:

  • ❌ TinyLlama 1.1B:

    “1. root:!:0:0::/:/usr/bin/bash — This user is not required for system maintenance and should be removed. 2. daemon:!:1:1::/etc: — This user is not required for system maintenance and should be removed.”

    Verdict: CATASTROPHIC FAIL. It claimed the “root” user was not required and recommended deleting it. Following this advice would destroy the server.

  • ✅ LFM2.5 1.2B:

    “Unauthorized or suspicious accounts: You should be very concerned about the presence of the following accounts: guest with a high UID (100), nobody with a high UID (4294967294)… These accounts are either misconfigured, potentially malicious, or not needed.”

    Verdict: Pass. It correctly identified actual potential risks like “guest” and “nobody” accounts with high UIDs.

Round 4: Service Hardening (lssrc -a)

We asked the models to review running services and recommend hardening steps.

The Input:

$ lssrc -a

Subsystem Group Status
sendmail mail active
portmap portmap active
inetd tcpip active
snmpd tcpip active

  • ❌ TinyLlama 1.1B:

    (generated 0 tokens)

    Verdict: Fails. Silence again.

  • ✅ LFM2.5 1.2B:

    “1. sendmail Risk: Running mail daemon can be a target for attacks. Recommendation: Disable sendmail. stopsrc -s sendmail 2. portmap Risk: Running portmap daemon can be abused… stopsrc -s portmap”

    Verdict: Pass. It flagged risky services like sendmail and portmap, and provided the correct AIX command (stopsrc) to disable them.

Why This Matters for IBM Power Users

This benchmark proves that IBM Power Systems are capable AI inference engines for critical, on-premise tasks:

  • Data Sovereignty: Analyze sensitive errpt logs, financial data, or user audits locally. No data leaves your server.
  • Legacy Modernization: Use local LLMs to help understand and document legacy COBOL or C code residing on the server.
  • Efficiency: You don’t need a GPU cluster. You likely already own the hardware capable of doing this.

Try It Yourself

We believe in open source. We have released the AIX port and the converted Big-Endian models.

Code: gitlab.com/librepower/llama-aix
Models: huggingface.co/librepowerai

user@aix:~$ # Quick start on AIX
user@aix:~$ git clone https://gitlab.com/librepower/llama-aix.git
user@aix:~$ ./scripts/build_aix_73.sh

user@aix:~$ # Optimize threading for the "Sweet Spot"
user@aix:~$ smtctl -t 2 -w now

user@aix:~$ # Have fun!

New IBM FlashSystem 2026 | 5600, 7600 and 9600

IBM has just unveiled the next generation of IBM FlashSystem in Warsaw. Three new models (5600, 7600 and 9600), an integrated AI engine called FlashSystem.ai, the fifth generation of its proprietary flash drive and a message that sounds great in a keynote: “autonomous storage co-run by agentic AI”.

We work with FlashSystem arrays every day. But that’s exactly why we think this deserves a proper analysis ;) So let’s dig into what’s behind each number and where the real value lies.

What IBM brings and why it’s the biggest launch in six years

IBM isn’t exaggerating when they say this is their most significant launch since 2020. It’s not a cosmetic refresh: it’s three simultaneous arrays, a complete redesign of the flash drive and an AI layer that goes well beyond what the previous generation offered.

The underlying idea is that the storage array should stop being “a cabinet that stores data” and become a system that analyses, optimises and protects itself autonomously.

Sam Werner, GM of IBM Storage, was pretty clear during the presentation: it’s not about replacing the administrator, but about freeing them from spending all day on repetitive tasks so they can focus on architecture and planning.

Does it sound like a keynote slide? A bit. But the hardware numbers behind it are real and, in some cases, genuinely impressive.

The 2026 FlashSystem models: FlashSystem 5600, 7600 and 9600

The new lineup replaces the FlashSystem 5300, 7300 and 9500 with substantial improvements in capacity, density and efficiency. All three models adopt the NVMe EDSFF (Enterprise and Data Center SSD Form Factor) drive format, which is the industry standard for maximum density and cooling in data centre environments:

FlashSystem 5600FlashSystem 7600FlashSystem 9600
Form factor1U · 12 drives2U · 32 drives2U · 32 drives (previously 4U)
CPU2×12 Core Intel Xeon · PCIe Gen 42×16 Core AMD EPYC · PCIe Gen 52×48 Core AMD EPYC · PCIe Gen 5
Raw capacity633 TB1.68 PB3.37 PB (vs 1.8 PB on the 9500)
Usable capacity400 TBu1.2 PBu2.4 PBu
Effective capacityUp to 2.4 PBeUp to 7.2 PBeUp to 11.8 PBe
IOPS2.6 million4.3 million6.3 million
Read BW30 GB/s55 GB/s86 GB/s (vs 100 GB/s on the 9500)
Ports16× FC or 20× Ethernet32× FC or Ethernet32× FC or Ethernet (vs 48 on the 9500)
Use caseEdge, ROBO, small DCsVirtualisation, analyticsBanking, ERP, AI workloads

An interesting generational leap: the 7600 and 9600 run AMD EPYC with PCIe Gen 5, while the 5600 stays with Intel Xeon and PCIe Gen 4. This makes sense by segment — PCIe Gen 5 doubles bandwidth per lane, but for the 5600’s edge use case, Gen 4 is more than enough and likely helps keep the price down.

The standout figure is the FlashSystem 5600 packing 2.4 PBe into 1U. For edge environments or data centres with space constraints, this changes the equation. And the 9600 shrinks from 4U to 2U while nearly doubling raw capacity (from 1.8 PB to 3.37 PB). That’s real progress, not marketing.

That said, an important caveat: “effective” capacities (PBe) assume deduplication and compression ratios that depend heavily on the type of data. With already-compressed or encrypted data, those 11.8 PBe on the 9600 become its 2.4 PBu (usable) or 3.37 PB raw. That’s physics, not magic. IBM specifies this in the footnotes, but it’s worth keeping firmly in mind.

Another interesting detail that has gone largely unnoticed: the 9600 drops from 48 to 32 I/O ports and its maximum read bandwidth goes from 100 GB/s to 86 GB/s compared to the 9500. It’s a design trade-off: more density at the cost of some raw connectivity. Depending on your architecture, this may or may not matter, but it’s worth knowing.

The 7600 and 9600 models feature interactive LED bezels to visualise system status. It may seem like a minor detail, but any admin who’s had to identify a chassis at 3 AM in a data centre will appreciate it.

New IBM FlashSystem 2026 family models 5600 7600 9600 front view with interactive LED bezels

New IBM FlashSystem 2026 — models 5600, 7600 and 9600

FlashCore Module 5: QLC performing like TLC (and why that matters)

This is where IBM has a competitive advantage that isn’t smoke and mirrors: they design and manufacture their own flash drives. And in the new fifth-generation FlashCore Module (FCM5), this translates into something very tangible.

The FCM5 comes in capacities of 6.6 / 13.2 / 26.4 / 52.8 and 105.6 TB in the new NVMe EDSFF format. That last figure, 105 TB per drive, is the highest in the industry for enterprise workloads. How do they achieve it? By using QLC NAND with proprietary IP that performs like TLC.

For those who don’t live and breathe storage every day: QLC (Quad-Level Cell) is denser and cheaper than TLC (Triple-Level Cell), but normally has lower write endurance and worse performance. Competitors using standard QLC limit it to read-intensive workloads. IBM, by controlling the drive design end to end, has managed to overcome that limitation. In fact, according to IBM’s own figures, the FCMs achieve 5.5× more write cycles than industry-standard QLC drives.

Alistair Symon, VP of Storage Systems Development, explained during the pre-launch briefing: other manufacturers offer higher-capacity QLC drives, but being standard QLC, they can’t sustain write-intensive workloads throughout the hardware’s depreciation lifecycle. IBM’s FCM5s can.

What else does the FCM5 integrate directly into the hardware?

  • Quantum-safe encryption for all data, directly on the drive
  • Hardware-accelerated compression
  • On-drive deduplication (new in this generation), enabling 5:1 data reduction ratios
  • Hardware-accelerated I/O analytics: complex statistics on every operation with zero performance impact

By offloading these operations to the flash module instead of running them on the array controllers, IBM frees up processing power for client workloads. It’s the same philosophy they applied in previous FCM generations for anomaly detection, but taken a step further with integrated deduplication.

FlashSystem.ai: agentic AI in the array, the good and the caveats

FlashSystem.ai is the new data services layer powered by agentic AI (we also deploy AI agents in enterprise environments, by the way). According to IBM, it’s trained on tens of billions of telemetry data points and years of real-world operational data. The system executes thousands of automated decisions per day that previously required human oversight.

The most interesting capabilities:

  • Adapts to application behaviour in hours, not weeks like template-based systems
  • Recommends performance optimisations explaining its reasoning (this enables auditing of AI decisions, which is crucial for compliance)
  • Incorporates administrator feedback to refine recommendations over time
  • Intelligent workload placement with non-disruptive data mobility, including third-party arrays
  • Cuts documentation time for audits and compliance in half

And of course, the headline figure: 90% reduction in manual management effort. It’s a spectacular number, but it comes with a footnote that deserves a careful read. IBM measures it by comparing specific routine tasks (volume provisioning with protected copy and DR policies) on the new generation with FlashSystem.ai vs. the same generation without FlashSystem.ai. It’s an internal, lab-based comparison on selected operations.

Does that mean the 90% is made up? No. It means it’s the best-case scenario on specific tasks. In real-world operations, with their integrations, quirks and the natural entropy of any infrastructure, the benefit will be smaller. It will probably still be significant — automating repetitive tasks delivers real value — but don’t expect your storage admin to suddenly work one day a week.

There is something we find genuinely useful, though: explainable operational reasoning. The system doesn’t just do things — it explains why. For audits and compliance (which carry more weight every year in regulated industries), having an AI-generated log of operational decisions is a real advantage over the competition.

Ransomware detection in 60 seconds: the data and the fine print

Another eye-catching claim: the FCM5 detects ransomware in under a minute. Let’s take a closer look.

The system analyses every I/O operation directly in hardware, looking for anomalous patterns associated with malicious encryption. The detection model (version 3.3, released in Q4 2025) has 24 months of training with production telemetry and maintains a false positive rate below 1%, measured over 3 months.

Combined with Safeguarded Copy (copies with a logical air gap, immutable and hidden from any external connection), IBM claims it’s possible to recover from an attack in under a minute.

Now, the fine print that IBM puts in the footnotes:

  • The original sub-minute detection test was performed on a FlashSystem 5200 (previous generation) with an IBM proprietary ransomware simulator called WannaLaugh. Yes, that’s really what it’s called. Bonus points for the naming.
  • Detection targets the start of the encryption process, not the initial system intrusion.
  • A sufficiently sophisticated ransomware that encrypts slowly and mimics normal write patterns could potentially evade detection.

That said — and this is important — having hardware-level detection with less than 1% false positives is objectively good. Most ransomware detection solutions on the market operate at the file system or network level, with higher latencies and error rates. IBM operates one layer lower, directly at the storage I/O level. As an additional layer in a defence-in-depth strategy, it delivers real value that competitors can’t easily replicate because they don’t control their flash drive hardware.

The argument that doesn’t show up in datasheets: the supply chain

An angle that may be more relevant than any benchmark for many IT leaders in 2026: the storage supply chain crisis. The demand for capacity to train AI models is creating SSD shortages and price increases globally.

Werner was direct about it: IBM is better positioned than most competitors because it manufactures its own flash drives.

If you’re planning a capacity expansion over the next 12–18 months and facing 6-month lead times on standard SSDs, having a vendor that controls its own drive manufacturing is an argument that won’t appear in any Gartner comparison but could define a project.

Pure Storage, Dell, NetApp and the rest — so what now?

The enterprise all-flash array market is fiercely competitive. Let’s compare what truly sets the new FlashSystem apart:

AspectIBM FlashSystem (new)Pure Storage FlashArrayDell PowerStoreNetApp AFF
Proprietary drives✅ FlashCore Module (QLC→TLC)✅ DirectFlash (150 TB)❌ Standard SSDs❌ Standard SSDs
AI-driven managementFlashSystem.ai (agentic)Pure1 MetaCloudIQBlueXP / AIOps
HW ransomware detection✅ On the flash drive❌ Software❌ Software❌ Software (ONTAP)
Third-party array support✅ 500+ vendors❌ Closed ecosystemPartialPartial (FabricPool)
Quantum-safe HW encryption
Licensing modelTraditional IBMEvergreen (very transparent)APEX / traditionalKeystone / traditional

Where IBM clearly gains ground:

The FlashCore Module is a real advantage that’s hard to replicate. Controlling the flash drive design enables hardware-level functionality (ransomware detection, quantum-safe encryption, deduplication) that competitors can only do in software. Pure Storage also designs its own drives (DirectFlash), but as of today it doesn’t integrate ransomware detection or post-quantum encryption into the hardware.

Compatibility with over 500 third-party storage vendors through FlashSystem Grid is a smart move. In the real world, nobody has a homogeneous environment, and being able to move data non-disruptively between IBM and other manufacturers’ arrays solves a genuine consolidation and migration problem.

Where the competition pushes back:

  • Pure Storage consistently scores higher on user experience and its Evergreen model is hard to beat for licensing transparency. If price and procurement simplicity are your priorities, Pure remains a formidable rival.
  • NetApp has ONTAP, a storage operating system with incredible maturity in hybrid environments and a massive installed base. If you’re already in the NetApp ecosystem, migrating is hard to justify on new features alone.
  • Dell PowerStore competes well on price and has deep integration with the VMware ecosystem (now Broadcom), which remains the dominant hypervisor in many organisations.

In summary: IBM doesn’t sweep the competition, but with this generation it positions itself with solid technical arguments that go beyond marketing, especially on hardware-level security and multi-vendor flexibility.

IBM FlashCore Module 5 proprietary flash drive for new IBM FlashSystem with hardware ransomware detection

IBM FlashCore Module 5

Who should care?

After digesting all the information, the scenarios where the new FlashSystem fits best:

  • Mission-critical environments (banking, insurance, healthcare) where the combination of hardware-level ransomware detection, Safeguarded Copy and quantum-safe encryption adds security layers that the competition can’t match at the same level.
  • Organisations with heterogeneous infrastructure that need to consolidate without a total rip-and-replace. Compatibility with 500+ vendors and data mobility between arrays is a genuine argument.
  • Space-constrained data centres where the 5600’s density (2.4 PBe in 1U) can avoid physical expansions.
  • Businesses already running IBM Storage infrastructure looking for a natural evolution that integrates with their existing investments, especially when combined with SVC or other portfolio solutions.

And who should think twice? If your environment is 100% virtualised with VMware and all your management goes through vCenter, Dell’s integration may make more operational sense. If your priority is procurement simplicity and your team is small, Pure Storage’s Evergreen model is tough to beat.

Conclusions

IBM has done its homework with this launch. The combination of proprietary hardware (FCM5 with QLC performing like TLC), integrated agentic AI (FlashSystem.ai) and the multi-vendor compatibility strategy positions the new FlashSystem as a serious proposition.

Is it perfect? No. The marketing inflates some numbers (as everyone does, let’s be honest), the “autonomous storage” label is more aspirational than descriptive, and until we see independent benchmarks and field experience over months of operation, certain claims remain promises.

But if we put everything on a scale: the technology is solid, the architecture makes sense, and the direction is right. And the fact that IBM controls everything from the flash drive design to the AI layer through the SVC operating system gives it a stack coherence that not many manufacturers can offer.

General availability: 6 March 2026.

Evaluating the new IBM FlashSystem or planning a migration?

At SIXE we’ve been working with IBM FlashSystem arrays in real production environments for years. We design, deploy, migrate and provide IBM Storage technical support with no middlemen. If you need advice on how this new generation would fit into your infrastructure, we’d be happy to help.

Distributed Storage 2026. A (controversial) technical guide

Another year alive, another year of watching the distributed storage industry outdo itself in commercial creativity. If 2024 was the year everyone discovered they needed “storage for AI” (spoiler: it’s the same old storage, but with better marketing), 2025 has been the year MinIO decided to publicly immolate itself while the rest of the ecosystem continues to evolve apace.

Hold on, the curves are coming.

Distributed storage market trends and curves 2026

The Drama of the Year: MinIO Goes Into “Maintenance Mode” (Read: Abandonment Mode)

If you haven’t been following the MinIO soap opera, let me give you some context. MinIO was the open source object storage that everyone was deploying. Simple, fast, S3 compatible. You had it up and running in 15 minutes. It was the WordPress of object storage.

MinIO maintenance mode announcement Reddit screenshot December 2025

Well, in December 2025, a silent commit in the README changed everything: “This project is currently under maintenance and is not accepting new changes.” No announcement. No migration guide. No farewell. Just a commit and goodbye.

The community, predictably, went up in flames. One developer summed it up perfectly: “A silent README update just ended the era of MinIO as the default open-source S3 engine.”

But this didn’t come out of the blue. MinIO had been pursuing an “open source but don’t overdo it” strategy for years:

  • 2021: Silent switch from Apache 2.0 to AGPL v3 (no announcement, no PR, no nothing)
  • 2022-2023: Aggressive campaigns against Nutanix and Weka for “license violations”
  • February 2025: Web console, bucket management and replication removed from the Community version
  • October 2025: Stop distributing Docker images
  • December 2025: Maintenance mode

The message is clear: if you want MinIO for real, pay up. Their enterprise AIStor product starts at €96,000/year for 400 TiB. For 1 PB, we are talking about more than €244,000/year.

The lesson? In 2025, “Open Source” without Open Governance is worthless. MinIO was a company with an open source product, not a community project. The difference matters.

In the Meantime, Ceph Continues to Swim Peacefully

While MinIO was self-destructing, Ceph was celebrating its 20th stable release: Tentacle (v20.2.0), released in November 2025. The project accumulates more than 1 exabyte of storage deployed globally on more than 3,000 clusters.

Ceph Tentacle release mascot v20.2.0 November 2025

The most interesting thing about Tentacle is FastEC (Fast Erasure Coding), which improves the performance of small reads and writes by 2x to 3x. This makes erasure coding finally viable for workloads that are not pure cold file. With a 6+2 EC profile, you can now achieve approximately 50% of the performance of replication 3 while using only 33% of the space.

For those of us who have been hearing “erasure coding is slow for production” for years, this is a real game changer.

Other Tentacle news:

  • Integrated SMB support via Samba Manager
  • NVMe/TCP gateway groups with multi-namespace support
  • OAuth 2.0 authentication on the dashboard
  • CephFS case-insensitive directories (finally)
  • ISA-L replaces Jerasure (which was abandoned)

The Crimson OSD (based on Seastar for NVMe optimization) is still in technical preview. It is not production ready, but the roadmap is promising.

The Numbers That Matter

Bloomberg operates more than 100 PBs in Ceph clusters. They are a Diamond member of the Ceph Foundation and their Head of Storage Engineering is on the Governing Board. DigitalOcean has 54+ PBs in 37 production clusters. CERN maintains 50+ PBs in more than 10 clusters.

And here’s the interesting part: ZTE Corporation is among the top 3 global contributors to Ceph and number 1 in China. Its TECS CloveStorage product (based on Ceph) is deployed in more than 320 NFV projects worldwide, including China Mobile, izzi Telecom (Mexico) and Deutsche Telekom.

The telco sector is Ceph’s secret superpower. While many are still thinking of traditional appliances, telcos are running Ceph in production on a massive scale.

The Enterprise Ecosystem: Understanding What You’re Buying

This is where it gets interesting. And it’s worth understanding what’s behind each option.

Enterprise storage market bazaar comparison 2026

IBM Fusion: Two Flavors for Different Needs

IBM has two products under the Fusion brand, and it is important to understand the difference:

  • IBM Fusion HCI: Uses IBM Storage Scale ECE (the old GPFS/Spectrum Scale). Parallel file system with distributed erasure coding. Hyperconverged appliance that scales from 6 to 20 nodes.
  • IBM Fusion SDS: Uses OpenShift Data Foundation (ODF), which is based on Ceph packaged by Red Hat.

Storage Scale is a genuinely differentiated technology, especially for HPC. Its parallel file system architecture offers capabilities that Ceph simply doesn’t have: advanced metadata management, integrated tiering, AFM for federation…. If you have high-performance computing, supercomputing or AI workloads at serious scale, Storage Scale has solid technical arguments to justify it.

IBM Fusion HCI performance claims are impressive: 90x acceleration on S3 queries with local caching, performance equivalent to Databricks Photon at 60% of the cost, etc.

Now, it’s always worth asking the question: how much of that performance is proprietary technology and how much is simply well-dimensioned hardware with an appropriate configuration? It’s not a criticism, it’s a legitimate question that any architect should ask before making a decision.

In the case of Fusion SDS, you’re buying Ceph with the added value of packaging, OpenShift integration, and IBM enterprise support. For many organizations, that has real value.

Red Hat Ceph Storage: The Enterprise Standard

Red Hat Ceph Storage continues to be the enterprise distribution of choice. They offer 36 months of production support plus optional 24 months of extended support. The product is robust and well integrated.

What you are really buying is: tested and certified packages, 24/7 enterprise support, predictable life cycles, and OpenShift integration.

Is it worth it? It depends on your context. If your organization needs a support contract to comply with compliance or simply to sleep easy, probably yes. And we’d be happy to help you with that. But if you have the technical team to operate Ceph upstream, it’s a decision that deserves analysis.

SUSE: A Lesson in Vendor Lock-in

Here’s a story that bears reflection: SUSE completely exited the Ceph enterprise market. Their SUSE Enterprise Storage (SES) product reached end of support in January 2023. After acquiring Rancher Labs in 2020, they pivoted to Longhorn for Kubernetes-native storage.

If you were an SES customer, you found yourself orphaned. Your options were to migrate to Red Hat Ceph Storage, Canonical Charmed Ceph, community Ceph, or find a specialized partner to help you with the transition.

This is not a criticism of SUSE; companies pivot according to their strategy. But it is a reminder that control over your infrastructure has value that doesn’t always show up in TCO.

Pure Storage and NetApp: The Appliance Approach

Pure Storage has created a category called “Unified Fast File and Object” (UFFO) with its FlashBlade family. Impressive hardware, consistent performance, polished user experience. Its FlashBlade//S R2 scales up to 60 PB per cluster with 150 TB DirectFlash Modules.

NetApp StorageGRID 12.0 focuses on AI with 20x throughput improvements via advanced caching and support for more than 600 billion objects in a single cluster.

Both are solid solutions that compete with Ceph RGW in S3-compatible object storage. The performance is excellent. The question each organization must ask itself is whether the premium justifies the vendor lock-in for their specific use case.

The Question No One Asks: What Are You Really Buying?

This is where I put on my thoughtful engineer’s hat.

Ceph upstream is extremely stable. It has 20 releases under its belt. The Ceph Foundation includes IBM, Red Hat, Bloomberg, DigitalOcean, OVHcloud and dozens more. Development is active, the community is strong, and documentation is extensive.

So when does it make sense to pay for enterprise distribution and when does it not?

It makes sense when: your organization requires a compliance support contract or internal policy; you don’t have the technical staff to operate Ceph and you don’t want to develop it; you need predictable and tested upgrade cycles; the cost of downtime is higher than the cost of the license; or you need specific integration with other vendor products.

It deserves further analysis when: the decision is based on “it’s what everyone does”; no one has really evaluated the alternatives; the main reason is that “the vendor told us that open source is not supported”; or you have capable technical equipment but have not invested in their training.

The real challenge is knowledge. Ceph has a steep learning curve. Designing a cluster correctly, understanding CRUSH maps, tuning BlueStore, optimizing placement groups… this requires serious training and hands-on experience.

But once you have that knowledge, you have options. You can choose an enterprise vendor judiciously, knowing exactly what value-add you are buying. Or you can operate upstream with specialized support. The key is to make an informed decision.

Demystifying Marketing Claims

One thing I always recommend is to read benchmarks and marketing claims with a constructive critical spirit.

“Our product is 90x faster” – Compared to what baseline? On what specific workload? With what competitor configuration?

“Performance equivalent to [competitor] at 60% of cost” – Does this include full TCO? Licensing, support, training, personnel?

“Enterprise-grade certified solution” – what exactly does that mean? Because Ceph upstream is also enterprise-grade at CERN, Bloomberg, and hundreds of telecoms.

I’m not saying the claims are false. I am saying that context matters. The reality is that distributed storage performance is highly dependent on correct cluster design (failure domains, placement groups), appropriate hardware (25/100GbE network, NVMe with power-loss protection), operating system configuration (IOMMU, CPU governors), and workload specific tuning (osd_memory_target, bluestore settings).

A well-designed Ceph cluster operated by experienced people can achieve impressive performance. The Clyso benchmark achieved 1 TiB/s with 68 Dell PowerEdge servers. IBM demonstrated over 450,000 IOPS on a 4-node Ceph cluster with 24 NVMe per node.

Sometimes, that “certified solution” seal you see on a datasheet is, at heart, free software with an expert configuration, well-dimensioned hardware, and a lot of testing. Which has value, but it’s good to know.

The Smart Move: Deciding With Information

After 15 years in this industry, my conclusion is that there is no single answer. What there is are informed decisions.

For some organizations, a packaged enterprise solution is exactly what they need: guaranteed support, predictable cycle times, validated integration, and the peace of mind of having an accountable vendor. IBM Fusion with Storage Scale is an excellent choice for HPC. Red Hat Ceph Storage is solid for anyone who wants Ceph with enterprise backup.

For other organizations, Ceph upstream with specialized support and training offers significant advantages:

  1. Foundation governance: Ceph is a project of the Linux Foundation with open governance. MinIO can’t happen.
  2. Active community: Thousands of contributors, regular releases, bugs fixed quickly.
  3. Flexibility: It’s your cluster, your configuration. If tomorrow you decide to change your support partner, you lose nothing.
  4. Transparent TCO: The software is free. You invest in appropriate hardware and knowledge.
  5. Version control: You update when it makes sense for you, not when the vendor puts out the next packaged release.

The common denominator in both cases is knowledge. Whether you buy an enterprise solution or deploy upstream, understanding Ceph in depth allows you to make better decisions, negotiate better with vendors, and solve problems faster.

Where to Get This Knowledge

Ceph is complex. But there are clear paths:

The official documentation is extensive and much improved. The Ceph blog has excellent technical deep-dives.

Cephalocon is the annual conference where you can learn from those who operate Ceph at full scale (Bloomberg, CERN, DigitalOcean).

Structured training with hands-on labs is the most efficient way to build real competence. You don’t learn Ceph by reading slides; you learn by breaking and fixing clusters.

L3 technical support from people who live Ceph every day gets you out of trouble when things get complicated in production. Because they will. At SIXE, we’ve spent years training technical teams in Ceph and providing L3 support to organizations of all types: those operating upstream, those using enterprise distributions, and those evaluating options. Our Ceph training program covers everything from basic architecture to advanced operations with real hands-on labs. And if you already have Ceph in production and need real technical support, our specialized technical support is designed for exactly that.

In addition, we have just launched a certification program with badges in Credly so that your team can demonstrate their competencies in a tangible way. Because in this industry, “be Ceph” doesn’t mean the same thing to everyone.

Conclusions for 2026

  1. MinIO is dead for serious use. Look for alternatives. Ceph RGW, SeaweedFS, or even the OpenMaxIO fork if you are brave.
  2. Understand what you are buying. There are cases where a packaged enterprise solution brings real value. There are others where you are mainly paying for a logo and a configuration that you could replicate.
  3. Ceph upstream is mature and production-ready. Bloomberg, DigitalOcean, CERN and 320+ telecoms projects can’t all be wrong.
  4. The true cost of distributed storage is knowledge. Invest in quality training and support, regardless of which option you choose.
  5. Control over your infrastructure has value. Ask SUSE SES customers how it went when the vendor decided to pivot.
  6. The governance of the project matters as much as the technology. Open Foundation > company with open source product.

2026 looks interesting. FastEC is going to change the erasure coding equation. Integration with AI and ML will continue to push for more performance. And the vendor ecosystem will continue to evolve with proposals that deserve serious evaluation.

You decide. This is the only important thing.

Porting MariaDB to IBM AIX (Part 2): how AIX matches Linux

From “AIX is Slow” to “AIX Matches Linux” (with the right tools and code)

In Part 1, I wrestled with CMake, implemented a thread pool from scratch, and shipped a stable MariaDB 11.8.5 for AIX. The server passed 1,000 concurrent connections, 11 million queries, and zero memory leaks.

Then I ran a vector search benchmark.

AIX: 42 queries per second.
Linux (same hardware): 971 queries per second.

Twenty-three times slower. On identical IBM Power S924 hardware. Same MariaDB version. Same dataset.

This is the story of how we discovered there was no performance gap at all — just configuration mistakes and a suboptimal compiler.

Chapter 1: The Sinking Feeling

There’s a particular kind of despair that comes from seeing a 23x performance gap on identical hardware. It’s the “maybe I should have become a florist” kind of despair.

Let me set the scene: both machines are LPARs running on IBM Power S924 servers with POWER9 processors at 2750 MHz. Same MariaDB 11.8.5. Same test dataset — 100,000 vectors with 768 dimensions, using MariaDB’s MHNSW (Hierarchical Navigable Small World) index for vector search.

The benchmark was simple: find the 10 nearest neighbors to a query vector. The kind of operation that powers every AI-enhanced search feature you’ve ever used.

Linux did it in about 1 millisecond. AIX took 24 milliseconds.

My first instinct was denial. “The benchmark must be wrong.” It wasn’t. “Maybe the index is corrupted.” It wasn’t. “Perhaps the network is slow.” It was a local socket connection.

Time to dig in.

Chapter 2: The First 65x — Configuration Matters

The Cache That Forgot Everything

The first clue came from MariaDB’s profiler. Every single query was taking the same amount of time, whether it was the first or the hundredth. That’s not how caches work.

I checked MariaDB’s MHNSW configuration:

SHOW VARIABLES LIKE 'mhnsw%';
mhnsw_max_cache_size: 16777216

16 MB. Our vector graph needs about 300 MB to hold the HNSW structure in memory.

Here’s the kicker: when the cache fills up, MariaDB doesn’t evict old entries (no LRU). It throws everything away and starts fresh. Every. Single. Query.

Imagine a library where, when the shelves get full, the librarian burns all the books and orders new copies. For every patron.

Fix: mhnsw_max_cache_size = 4GB in the server configuration.

Result: 42 QPS → 112 QPS. 2.7x improvement from one config line.

The Page Size Problem

AIX defaults to 4 KB memory pages. Linux on POWER uses 64 KB pages.

For MHNSW’s access pattern — pointer-chasing across a 300 MB graph — this matters enormously. With 4 KB pages, you need 16x more TLB (Translation Lookaside Buffer) entries to map the same amount of memory. TLB misses are expensive.

Think of it like navigating a city. With 4 KB pages, you need directions for every individual building. With 64 KB pages, you get directions by neighborhood. Much faster when you’re constantly jumping around.

Fix: Wrapper script that sets LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STACKPSIZE=64K@SHMPSIZE=64K

Result: 112 QPS → 208 QPS sequential, and 2,721 QPS with 12 parallel workers.

The Scoreboard After Phase 1

ConfigurationSequential QPSWith 12 Workers
Baseline42~42
+ 4GB cache112
+ 64K pages2082,721

65x improvement from two configuration changes. No code modifications.

But we were still 6x slower than Linux per-core. The investigation continued.

Chapter 3: The CPU vs Memory Stall Mystery

With configuration fixed, I pulled out the profiling tools. MariaDB has a built-in profiler that breaks down query time by phase.

AIX:

Sending data: 4.70ms total
  - CPU_user: 1.41ms
  - CPU_system: ~0ms
  - Stalls: 3.29ms (70% of total!)

Linux:

Sending data: 0.81ms total
  - CPU_user: 0.80ms
  - Stalls: ~0.01ms (1% of total)

The CPU execution time was 1.8x slower on AIX — explainable by compiler differences. But the memory stalls were 329x worse.

The Root Cause: Hypervisor Cache Invalidation

Here’s something that took me two days to figure out: in a shared LPAR (Logical Partition), the POWER hypervisor periodically preempts virtual processors to give time to other partitions. When it does this, it may invalidate L2/L3 cache lines.

MHNSW’s graph traversal is pointer-chasing across 300 MB of memory — literally the worst-case scenario for cache invalidation. You’re jumping from node to node, each in a different part of memory, and the hypervisor is periodically flushing your cache.

It’s like trying to read a book while someone keeps closing it and putting it back on the shelf.

The Linux system had dedicated processors. The AIX system was running shared. Not apples to apples.

But before I could test dedicated processors, I needed to fix the compiler problem.

Chapter 4: The Compiler Odyssey

Everything I Tried With GCC (And Why It Failed)

AttemptResultWhy
-flto (Link Time Optimization)ImpossibleGCC LTO requires ELF format; AIX uses XCOFF
-fprofile-generate (PGO)Build failsTOC-relative relocation assembler errors
-ffast-mathBreaks everythingIEEE float violations corrupt bloom filter hashing
-funroll-loopsSlowerInstruction cache bloat — POWER9 doesn’t like it
-finline-functionsSlowerSame I-cache problem

The AIX Toolbox GCC is built without LTO support. It’s not a flag you forgot — it’s architecturally impossible because GCC’s LTO implementation requires ELF, and AIX uses XCOFF.

Ubuntu’s MariaDB packages use -flto=auto. That optimization simply doesn’t exist for AIX with GCC.

IBM Open XL: The Plot Twist

At this point, I’d spent three days trying to make GCC faster. Time to try something different.

IBM Open XL C/C++ 17.1.3 is IBM’s modern compiler, based on LLVM/Clang. It generates significantly better code for POWER9 than GCC.

Building MariaDB with Open XL required solving five different problems:

  1. Missing HTM header: Open XL doesn’t have GCC’s htmxlintrin.h. I created a stub.
  2. 32-bit AR by default: AIX tools default to 32-bit. Set OBJECT_MODE=64.
  3. Incompatible LLVM AR: Open XL’s AR couldn’t handle XCOFF. Used system /usr/bin/ar.
  4. OpenSSL conflicts: Used -DWITH_SSL=system to avoid bundled wolfSSL issues.
  5. Missing library paths: Explicit -L/opt/freeware/lib for the linker.

Then I ran the benchmark:

Compiler30 QueriesPer-Query
GCC 13.3.00.190s6.3ms
Open XL 17.1.30.063s2.1ms

Three times faster. Same source code. Same optimization flags (-O3 -mcpu=power9).

And here’s the bonus: GCC’s benchmark variance was 10-40% between runs. Open XL’s variance was under 2%. Virtually no jitter.

Why Such a Huge Difference?

Open XL (being LLVM-based) has:

  • Better instruction scheduling for POWER9’s out-of-order execution
  • Superior register allocation
  • More aggressive optimization passes

GCC’s POWER/XCOFF backend simply isn’t as mature. The AIX Toolbox GCC is functional, but it’s not optimized for performance-critical workloads.

Chapter 5: The LTO and PGO Dead Ends

Hope springs eternal. Maybe Open XL’s LTO and PGO would work?

LTO: The Irony

Open XL supports -flto=full on XCOFF. It actually builds! But…

Result: 27% slower than non-LTO Open XL.

Why? AIX shared libraries require an explicit export list (exports.exp). With LTO, CMake’s script saw ~27,000 symbols to export.

LTO’s main benefit is internalizing functions — marking them as local so they can be optimized away or inlined. When you’re forced to export 27,000 symbols, none of them can be internalized. The LTO overhead (larger intermediate files, slower link) remains, but the benefit disappears.

It’s like paying for a gym membership and then being told you can’t use any of the equipment.

PGO: The Profiles That Never Were

Profile-Guided Optimization sounded promising:

  1. Build with -fprofile-generate
  2. Run training workload
  3. Rebuild with -fprofile-use
  4. Enjoy faster code

Step 1 worked. Step 2… the profiles never appeared.

I manually linked the LLVM profiling runtime into the shared library. Still no profiles.

The root cause: LLVM’s profiling runtime uses atexit() or __attribute__((destructor)) to write profiles on exit. On AIX with XCOFF, shared library destructor semantics are different from ELF. The handler simply isn’t called reliably for complex multi-library setups like MariaDB.

Simple test cases work. Real applications don’t.

Chapter 6: The LPAR Revelation

Now I had a fast compiler. Time to test dedicated processors and eliminate the hypervisor cache invalidation issue.

The Test Matrix

LPAR ConfigGCCOpen XL
12 shared vCPUs0.190s0.063s
12 dedicated capped0.205s0.082s
21 dedicated capped0.320s0.067s

Wait. Shared is faster than dedicated?

The WoF Factor

POWER9 has a feature called Workload Optimized Frequency (WoF). In shared mode with low utilization, a single core can boost to ~3.8 GHz. Dedicated capped processors are locked at 2750 MHz.

For a single-threaded query, shared mode gets 38% more clock speed. That beats the cache invalidation penalty for this workload.

Think of it like choosing between a sports car on a highway with occasional traffic (shared) versus a truck with a reserved lane but a speed limit (dedicated capped).

The PowerVM Donating Mode Disaster

There’s a third option: dedicated processors in “Donating” mode, which donates idle cycles back to the shared pool.

ModeGCCOpen XL
Capped0.205s0.082s
Donating0.325s0.085s

60% regression with GCC.

Every time a query bursts, there’s latency reclaiming the donated cycles. For bursty, single-threaded workloads like database queries, this is devastating.

Recommendation: Never use Donating mode for database workloads.

The 21-Core Sweet Spot

With 21 dedicated cores (versus Linux’s 24), Open XL achieved 0.067s — nearly matching the 0.063s from shared mode. The extra L3 cache from more cores compensates for the lack of WoF frequency boost.

Chapter 7: The Final Scoreboard (Plot Twist)

Fresh benchmarks on identical POWER9 hardware, January 2026:

PlatformCores30 Queries
Linux24 dedicated0.057s
AIX + Open XL12 shared0.063s
AIX + Open XL21 dedicated0.067s
AIX + GCC12 shared0.190s
AIX + GCC21 dedicated0.320s

Wait. The AIX system has 21 cores vs Linux’s 24. That’s 12.5% fewer cores, which means 12.5% less L3 cache.

The measured “gap”? 10-18%.

That’s not a performance gap. That’s a hardware difference.

With IBM Open XL, AIX delivers identical per-core performance to Linux. The 23x gap we started with? It was never about AIX being slow. It was:

  1. A misconfigured cache (16MB instead of 4GB)
  2. Wrong page sizes (4KB instead of 64KB)
  3. The wrong compiler (GCC instead of Open XL)

The “AIX is slow” myth is dead.

The Complete Failure Museum

Science isn’t just about what works — it’s about documenting what doesn’t. Here’s our wall of “nice try, but no”:

What We TriedResultNotes
mhnsw_max_cache_size = 4GB5x fasterEliminates cache thrashing
LDR_CNTRL 64K pages~40% fasterReduces TLB misses
MAP_ANON_64K mmap patch~8% fasterMinor TLB improvement
IBM Open XL 17.1.33x fasterBetter POWER9 codegen
Shared LPAR (vs dedicated)~25% fasterWoF frequency boost
Open XL + LTO27% slowerAIX exports conflict
Open XL + PGODoesn’t workProfiles not written
GCC LTOImpossibleXCOFF not supported
GCC PGOBuild failsTOC relocation errors
-ffast-mathBreaks MHNSWFloat corruption
-funroll-loopsWorseI-cache bloat
POWER VSX bloom filter41% slowerNo 64-bit vec multiply on P9
Software prefetchNo effectHypervisor evicts prefetched data
DSCR tuningBlockedHypervisor controls DSCR in shared LPAR
Donating mode60% regressionNever use for databases

The VSX result is particularly interesting: we implemented a SIMD bloom filter using POWER’s vector extensions. It was 41% slower than scalar. POWER9 has no 64-bit vector multiply — you need vec_extract → scalar multiply → vec_insert for each lane, which is slower than letting the Out-of-Order engine handle a scalar loop.

What I Learned

1. Defaults Matter More Than You Think

A 16 MB cache default turned sub-millisecond queries into 24ms queries. That’s a 24x penalty from one misconfigured parameter.

When you’re porting software, question every default. What works on Linux might not work on your platform.

2. The “AIX is Slow” Myth Was Always a Toolchain Issue

With GCC, we were 3-4x slower than Linux. With Open XL, we match Linux per-core.

The platform was never slow. The default toolchain just wasn’t optimized for performance-critical workloads. Choose the right compiler.

3. Virtualization Has Hidden Trade-offs

Shared LPAR can be faster than dedicated for single-threaded workloads (WoF frequency boost). Dedicated is better for sustained multi-threaded throughput. Donating mode is a trap.

Know your workload. Choose your LPAR configuration accordingly.

4. Not Every Optimization Ports

LTO, PGO, and SIMD vectorization all failed on AIX for various reasons. The techniques that make Linux fast don’t always translate.

Sometimes the “obvious” optimization is the wrong choice. Measure everything.

5. Sometimes There’s No Gap At All

We spent days investigating a “performance gap” that turned out to be:

  • Configuration mistakes
  • Wrong compiler
  • Fewer cores on the test system

The lesson: verify your baselines. Make sure you’re comparing apples to apples before assuming there’s a problem to solve.

Recommendations

For AIX MariaDB Users

  1. Use the Open XL build (Release 3, coming soon)
  2. Set mhnsw_max_cache_size to at least 4GB for vector search
  3. Keep shared LPAR for single-query latency
  4. Never use Donating mode for databases
  5. Use 64K pages via the LDR_CNTRL wrapper

For Upstream MariaDB

  1. Increase mhnsw_max_cache_size default — 16MB is far too small
  2. Implement LRU eviction — discarding the entire cache on overflow is brutal
  3. Don’t add POWER VSX bloom filter — scalar is faster on POWER9

What’s Next

The RPMs are published at aix.librepower.org. Release 2 includes the configuration fixes. Release 3 with Open XL build is also available.

Immediate priorities:

  • Commercial Open XL license: Evaluation expires soon. Need to verify with IBM if we are ok using xLC for this purpose.
  • Native AIO implementation: AIX has POSIX AIO and Windows-compatible IOCP. Time to write the InnoDB backend.
  • Upstream MHNSW feedback: The default mhnsw_max_cache_size of 16MB is too small for real workloads; we’ll suggest a larger default.

For organizations already running mission-critical workloads on AIX — and there are many, from banks to airlines to healthcare systems — the option to also run modern, high-performance MariaDB opens new possibilities.

AIX matches Linux. The myth is dead. And MariaDB on AIX is ready for production.

TL;DR

  • Started with 23x performance gap (42 QPS vs 971 QPS)
  • Fixed cache config: 5x improvement
  • Fixed page size: ~40% more
  • Switched to IBM Open XL: 3x improvement over GCC
  • Used shared LPAR: ~25% faster than dedicated (WoF boost)
  • Final result: NO GAP — 10% difference = 12.5% fewer cores (21 vs 24)
  • AIX matches Linux per-core performance with Open XL
  • Open XL LTO: doesn’t help (27% slower)
  • Open XL PGO: doesn’t work (AIX XCOFF issue)
  • POWER VSX SIMD: 41% slower than scalar (no 64-bit vec multiply)
  • Donating mode: 60% regression — never use for databases
  • “AIX is slow for Open Source DBs” was always a toolchain myth

Questions? Ideas? Running MariaDB on AIX and want to share your experience?

This work is part of LibrePower – Unlocking IBM Power Systems through open source. Unmatched RAS. Superior TCO. Minimal footprint 🌍

LibrePower AIX project repository: gitlab.com/librepower/aix

Porting MariaDB to IBM AIX (Part 1): 3 Weeks of Engineering Pain

Bringing MariaDB to AIX, the Platform That Powers the World’s Most Critical Systems

There are decisions in life you make knowing full well they’ll cause you some pain. Getting married. Having children. Running a marathon. Porting MariaDB 11.8 to IBM AIX.

This (Part 1) is the story of the last one — and why I’d do it again in a heartbeat.

Chapter 1: “How Hard Can It Be?”

It all started with an innocent question during a team meeting: “Why don’t we have MariaDB on our AIX systems?”

Here’s the thing about AIX that people who’ve never worked with it don’t understand: AIX doesn’t mess around. When banks need five-nines uptime for their core banking systems, they run AIX. When airlines need reservation systems that cannot fail, they run AIX. When Oracle, Informix, or DB2 need to deliver absolutely brutal performance for mission-critical OLTP workloads, they run on AIX.

AIX isn’t trendy. AIX doesn’t have a cool mascot. AIX won’t be the subject of breathless tech blog posts about “disruption.” But when things absolutely, positively cannot fail — AIX is there, quietly doing its job while everyone else is busy rebooting their containers.

So why doesn’t MariaDB officially support AIX? Simple economics: the open source community has centered on Linux, and porting requires platform-specific expertise. MariaDB officially supports Linux, Windows, FreeBSD, macOS, and Solaris. AIX isn’t on the list — not because it’s a bad platform, but because no one had done the work yet.

At LibrePower, that’s exactly what we do.

My first mistake was saying out loud: “It’s probably just a matter of compiling it and adjusting a few things.”

Lesson #1: When someone says “just compile it” about software on AIX, they’re about to learn a lot about systems programming.

Chapter 2: CMake and the Three Unexpected Guests

Day one of compilation was… educational. CMake on AIX is like playing cards with someone who has a very different understanding of the rules — and expects you to figure them out yourself.

The Ghost Function Bug

AIX has an interesting characteristic: it declares functions in headers for compatibility even when those functions don’t actually exist at runtime. It’s like your GPS saying “turn right in 200 meters” but the street is a brick wall.

CMake does a CHECK_C_SOURCE_COMPILES to test if pthread_threadid_np() exists. The code compiles. CMake says “great, we have it!” The binary starts and… BOOM. Symbol not found.

Turns out pthread_threadid_np() is macOS-only. AIX declares it in headers because… well, I’m still not entirely sure. Maybe for some POSIX compatibility reason that made sense decades ago? Whatever the reason, GCC compiles it happily, and the linker doesn’t complain until runtime.

Same story with getthrid(), which is OpenBSD-specific.

The fix:

IF(NOT CMAKE_SYSTEM_NAME MATCHES "AIX")
  CHECK_C_SOURCE_COMPILES("..." HAVE_PTHREAD_THREADID_NP)
ELSE()
  SET(HAVE_PTHREAD_THREADID_NP 0)  # Trust but verify... okay, just verify
ENDIF()

poll.h: Hide and Seek

AIX has <sys/poll.h>. It’s right there. You can cat it. But CMake doesn’t detect it.

After three hours debugging a “POLLIN undeclared” error in viosocket.c, I discovered the solution was simply forcing the define:

cmake ... -DHAVE_SYS_POLL_H=1

Three hours. For one flag.

(To be fair, this is a CMake platform detection issue, not an AIX issue. CMake’s checks assume Linux-style header layouts.)

The Cursed Plugins

At 98% compilation — 98%! — the wsrep_info plugin exploded with undefined symbols. Because it depends on Galera. Which we’re not using. But CMake compiles it anyway.

Also S3 (requires Aria symbols), Mroonga (requires Groonga), and RocksDB (deeply tied to Linux-specific optimizations).

Final CMake configuration:

-DPLUGIN_MROONGA=NO -DPLUGIN_ROCKSDB=NO -DPLUGIN_SPIDER=NO 
-DPLUGIN_TOKUDB=NO -DPLUGIN_OQGRAPH=NO -DPLUGIN_S3=NO -DPLUGIN_WSREP_INFO=NO

It looks like surgical amputation, but it’s actually just trimming the fat. These plugins are edge cases that few deployments need.

Chapter 3: Thread Pool, or How I Learned to Stop Worrying and Love the Mutex

This is where things got interesting. And by “interesting” I mean “I nearly gave myself a permanent twitch.”

MariaDB has two connection handling modes:

  • one-thread-per-connection: One thread per client. Simple. Scales like a car going uphill.
  • pool-of-threads: A fixed pool of threads handles all connections. Elegant. Efficient. And not available on AIX.

Why? Because the thread pool requires platform-specific I/O multiplexing APIs:

PlatformAPIStatus
LinuxepollSupported
FreeBSD/macOSkqueueSupported
Solarisevent portsSupported
WindowsIOCPSupported
AIXpollsetNot supported (until now)

So… how hard can implementing pollset support be?

(Editor’s note: At this point the author required a 20-minute break and a beverage)

The ONESHOT Problem

Linux epoll has a wonderful flag called EPOLLONESHOT. It guarantees that a file descriptor fires events only once until you explicitly re-arm it. This prevents two threads from processing the same connection simultaneously.

AIX pollset is level-triggered. Only level-triggered. No options. If data is available, it reports it. Again and again and again. Like a helpful colleague who keeps reminding you about that email you haven’t answered yet.

Eleven Versions of Increasing Wisdom

What followed were eleven iterations of code, each more elaborate than the last, trying to simulate ONESHOT behavior:

v1-v5 (The Age of Innocence)

I tried modifying event flags with PS_MOD. “If I change the event to 0, it’ll stop firing,” I thought. Spoiler: it didn’t stop firing.

v6-v7 (The State Machine Era)

“I know! I’ll maintain internal state and filter duplicate events.” The problem: there’s a time window between the kernel giving you the event and you updating your state. In that window, another thread can receive the same event.

v8-v9 (The Denial Phase)

“I’ll set the state to PENDING before processing.” It worked… sort of… until it didn’t.

v10 (Hope)

Finally found the solution: PS_DELETE + PS_ADD. When you receive an event, immediately delete the fd from the pollset. When you’re ready for more data, add it back.

// On receiving events: REMOVE
for (i = 0; i < ret; i++) {
    pctl.cmd = PS_DELETE;
    pctl.fd = native_events[i].fd;
    pollset_ctl(pollfd, &pctl, 1);
}

// When ready: ADD
pce.command = PS_ADD;
pollset_ctl_ext(pollfd, &pce, 1);

It worked! With -O2.

With -O3segfault.

The Dark Night of the Soul (The -O3 Bug)

Picture my face. I have code working perfectly with -O2. I enable -O3 for production benchmarks and the server crashes with “Got packets out of order” or a segfault in CONNECT::create_thd().

I spent two days thinking it was a compiler bug. GCC 13.3.0 on AIX. I blamed the compiler. I blamed the linker. I blamed everything except my own code.

The problem was subtler: MariaDB has two concurrent code paths calling io_poll_wait on the same pollset:

  • The listener blocks with timeout=-1
  • The worker calls with timeout=0 for non-blocking checks

With -O2, the timing was such that these rarely collided. With -O3, the code was faster, collisions happened more often, and boom — race condition.

v11 (Enlightenment)

The fix was a dedicated mutex protecting both pollset_poll and all pollset_ctl operations:

static pthread_mutex_t pollset_mutex = PTHREAD_MUTEX_INITIALIZER;

int io_poll_wait(...) {
    pthread_mutex_lock(&pollset_mutex);
    ret = pollset_poll(pollfd, native_events, max_events, timeout);
    // ... process and delete events ...
    pthread_mutex_unlock(&pollset_mutex);
}

Yes, it serializes pollset access. Yes, that’s theoretically slower. But you know what’s even slower? A server that crashes.

The final v11 code passed 72 hours of stress testing with 1,000 concurrent connections. Zero crashes. Zero memory leaks. Zero “packets out of order.”

Chapter 4: The -blibpath Thing (Actually a Feature)

One genuine AIX characteristic: you need to explicitly specify the library path at link time with -Wl,-blibpath:/your/path. If you don’t, the binary won’t find libstdc++ even if it’s in the same directory.

At first this seems annoying. Then you realize: AIX prefers explicit, deterministic paths over implicit searches. In production environments where “it worked on my machine” isn’t acceptable, that’s a feature, not a bug.

Chapter 5: Stability — The Numbers That Matter

After all this work, where do we actually stand?

The RPM is published at aix.librepower.org and deployed on an IBM POWER9 system (12 cores, SMT-8). MariaDB 11.8.5 runs on AIX 7.3 with thread pool enabled. The server passed a brutal QA suite:

TestResult
100 concurrent connections
500 concurrent connections
1,000 connections
30 minutes sustained load
11+ million queries
Memory leaksZERO

1,648,482,400 bytes of memory — constant across 30 minutes. Not a single byte of drift. The server ran for 39 minutes under continuous load and performed a clean shutdown.

It works. It’s stable. It’s production-ready for functionality.

Thread Pool Impact

The thread pool work delivered massive gains for concurrent workloads:

ConfigurationMixed 100 clientsvs. Baseline
Original -O2 one-thread-per-connection11.34s
-O3 + pool-of-threads v111.96s83% faster

For high-concurrency OLTP workloads, this is the difference between “struggling” and “flying.”

What I Learned (So Far)

  1. CMake assumes Linux. On non-Linux systems, manually verify that feature detection is correct. False positives will bite you at runtime.
  2. Level-triggered I/O requires discipline. EPOLLONESHOT exists for a reason. If your system doesn’t have it, prepare to implement your own serialization.
  3. -O3 exposes latent bugs. If your code “works with -O2 but not -O3,” you have a race condition. The compiler is doing its job; the bug is yours.
  4. Mutexes are your friend. Yes, they have overhead. But you know what has more overhead? Debugging race conditions at 3 AM.
  5. AIX rewards deep understanding. It’s a system that doesn’t forgive shortcuts, but once you understand its conventions, it’s predictable and robust. There’s a reason banks still run it — and will continue to for the foreseeable future.
  6. The ecosystem matters. Projects like linux-compat from LibrePower make modern development viable on AIX. Contributing to that ecosystem benefits everyone.

What’s Next: The Performance Question

The server is stable. The thread pool works. But there’s a question hanging in the air that I haven’t answered yet:

How fast is it compared to Linux?

I ran a vector search benchmark — the kind of operation that powers AI-enhanced search features. MariaDB’s MHNSW (Hierarchical Navigable Small World) index, 100,000 vectors, 768 dimensions.

Linux on identical POWER9 hardware: 971 queries per second.

AIX with our new build: 42 queries per second.

Twenty-three times slower.

My heart sank. Three weeks of work, and we’re 23x slower than Linux? On identical hardware?

But here’s the thing about engineering: when numbers don’t make sense, there’s always a reason. And sometimes that reason turns out to be surprisingly good news.

In Part 2, I’ll cover:

  • How we discovered the 23x gap was mostly a configuration mistake
  • The compiler that changed everything
  • Why “AIX is slow” turned out to be a myth
  • The complete “Failure Museum” of optimizations that didn’t work

The RPMs are published at aix.librepower.org. The GCC build is stable and production-ready for functionality.

But the performance story? That’s where things get really interesting.

Part 2 coming soon.

TL;DR

  • MariaDB 11.8.5 now runs on AIX 7.3 with thread pool enabled
  • First-ever thread pool implementation for AIX using pollset (11 iterations to get ONESHOT simulation right)
  • Server is stable: 1,000 connections, 11M+ queries, zero memory leaks
  • Thread pool delivers 83% improvement for concurrent workloads
  • Initial vector search benchmark shows 23x gap vs Linux — but is that the whole story?
  • RPMs published at aix.librepower.org
  • Part 2 coming soon: The performance investigation

Questions? Ideas? Want to contribute to the AIX open source ecosystem?

This work is part of LibrePower – Unlocking IBM Power Systems through open source. Unmatched RAS. Superior TCO. Minimal footprint 🌍

LibrePower AIX project repository: gitlab.com/librepower/aix

🦙 LLMs on AIX: technical experimentation beyond the GPU hype

At LibrePower, we have published Llama-AIX: a proof-of-concept for running lightweight LLM inference directly on AIX , using only CPU and memory—no GPUs involved.

It’s worth clarifying from the start: this is technical fun and experimentation. It is not a product, not a commercial promise, and not an alternative to large GPU-accelerated AI platforms.

That said, there is a sound technical foundation behind this experiment.

Not all LLM use cases are GPU-bound.

In many common business scenarios in Power environments:

  • RAG (Retrieval Augmented Generation)
  • Questions about internal documentation
  • On-prem technical assistants
  • Semantic search on own knowledge
  • Text analytics with strong dependence on latency and proximity to data

the bottleneck is not always the massive calculation, but:

  • CPU
  • Memory width
  • Data access latency
  • Data localization

In these cases, small and well bounded inferences can be reasonably executed without GPUs, especially when the model is not the center of the system, but just another piece.

⚙️ CPU, MMA and low-power accelerators

The natural evolution does not only involve GPUs:

  • Increasingly vectorized CPUs
  • Extensions as MMA
  • Specific and low-power accelerators (such as the future Spyre)
  • Closer integration to the operating system and data stack

This type of acceleration is especially relevant in Power architectures, where the design prioritizes sustained throughput, consistency and reliability, not just FLOPS peaks.

🧩 Why AIX?

Running this on AIX is not a necessity, it is a conscious choice to:

  • Understanding the real limits
  • Explore its technical feasibility
  • Dismantling simplistic assumptions
  • Learning how LLMs fit into existing Power systems

Many Power customers operate stable, amortized and critical infrastructures, where moving data to the cloud or introducing GPUs is not always desirable or feasible.

🔍 What Llama-AIX is (and isn’t)

  • ✔ A technical PoC
  • ✔ An honest exploration
  • ✔ An engineering exercise
  • ✔ Open source
  • ✖ Not a benchmark
  • ✖ Not a complete AI platform
  • ✖ Not intended to compete with GPU solutions
  • ✖ Not “AI marketing”.

The idea is simple: look beyond the hype, understand the nuances and assess where LLMs bring real value in Power and AIX environments.

Purely out of technical curiosity.

And because experimenting is still a fundamental part of engineering.

💬 In what specific use case would an on-prem LLM in Power make sense to you?

#LibrePower #AIX #IBMPower #LLM #RAG #OpenSource #EnterpriseArchitecture #AIOnPrem

We launched LibrePower (and this is its Manifesto)

We want to unleash the full potential of IBM Power

We build community to grow the Power ecosystem – more solutions, more users, more value.

The most capable hardware you’re not yet taking full advantage of

IBM Power underpins mission-critical computing around the world. Banking transactions, flight reservations, hospital systems, SAP installations – workloads that can’t afford to fail run on Power.

This is no coincidence.

Power systems offer legendary reliability thanks to their RAS (Reliability, Availability, Serviceability) architecture that x86 simply cannot match. They run trouble-free for 10, 15 years or more. Their design – large caches, SMT8, extraordinary memory bandwidth – is designed to keep performance at scale in a sustained manner.

But there is an opportunity that most organizations are missing:

Power can do much more than what is usually asked of it.

The capacity is there. The potential is enormous. What has been missing is independent validation, momentum from the community and accessible tools that open the door to new use cases.

Exactly what LibrePower is building.


What is LibrePower?

LibrePower is a community initiative to extend what is possible in IBM Power – across the entire ecosystem:

  • AIX – The veteran Unix that runs the most demanding enterprise loads
  • IBM i – The integrated system that thousands of companies around the world run on
  • Linux on Power (ppc64le) – Ubuntu, Rocky, RHEL, SUSE, Fedora on POWER Architecture

We are not here to replace anything. We come to add:

  • More certified solutions running on Power
  • More developers and administrators relying on the platform
  • More reasons to invest in Power – both in new and existing equipment

What we do

1. LibrePower Certified – independent validation

ISVs and open source projects need to know that their software works on Power. Buyers need confidence before deploying. IBM certification has its value, but there is room for independent community-driven validation.

LibrePower Certified offers three levels:

LevelMeaning
Works on PowerCompile and run correctly on ppc64le. Basic validation.
Optimized for PowerTuned for SMT, VSX/MMA. Includes performance data.
🏆 LibrePower CertifiedFull validation + case study + ongoing support.

Free for open source projects. Premium levels for ISVs looking for deeper collaboration.

Open source repositories

We compile, package and distribute software that the Power community needs:

  • AIX(aix.librepower.org): modern CLI tools, security utilities, compatibility layers
  • ppc64le: container tools (AWX, Podman), development utilities, monitoring
  • IBM i: open source integration (under development)

Everything is free. Everything is documented. Everything is in GitLab.

3. Performance testing and optimization

Power hardware has unique features that most software does not take advantage of. We benchmark, identify opportunities and work with upstream projects to improve performance on Power.

When we find optimizations, we contribute them back. The entire ecosystem benefits.

4. Building community

The Power world is fragmented. AIX administrators, Linux on Power teams, IBM i environments – all solving similar problems in isolation.

LibrePower connects these communities:

  • Cross-platform knowledge sharing
  • Amplify the collective voice to manufacturers and projects
  • Create a network of expertise in Power

5. Expanding the market

Every new solution validated in Power is one more reason for organizations to choose the platform. Every developer who learns Power is talent for the ecosystem. Every success story demonstrates value.

More solutions → more adoption → stronger ecosystem → more investment in Power.

This virtuous circle benefits everyone: IBM, partners, ISVs and users.

Why should you care?

If you manage Power systems:

  • Access tools and solutions you were missing
  • Join a community that understands your environment
  • Maximize the return on your hardware investment

If you are a developer:

  • Learn a platform with unique technical features
  • Contribute to projects with real impact in the enterprise world.
  • Develops expertise in a high value niche

If you are an ISV:

  • Get your software validated in Power
  • Connect with enterprise customers
  • Discover market opportunities in the Power ecosystem

If you are evaluating infrastructure:

  • Find out what’s really possible in Power beyond traditional charging
  • Find independent validation of solutions
  • Connect with the community to learn about real experiences

What we are working on

AIX(aix.librepower.org)

  • ✅ Modern package manager (dnf/yum for AIX)
  • ✅ fzf – fuzzy search engine (Go binary compiled for AIX)
  • ✅ nano – modern editor
  • 2FA tools – Google Authenticator with QR codes
  • 🔄 Coming soon: ripgrep, yq, modern coreutils

Linux ppc64le

  • 🔄 AWX – Ansible automation (full port in progress)
  • 🔄 Container Tools – Podman, Buildah, Skopeo
  • 🔄 Development tools – lazygit, k9s, modern CLIs

IBM i

  • 📋 Planning phase – assessing priorities with community input.

The vision

Imagine:

  • Every major open source project considers Power at the time of release
  • ISVs see Power as a strategic platform, not an afterthought
  • Organizations deploy new loads on Power with confidence
  • A connected and growing community that powers the ecosystem

That’s The Power Renaissance.

It is not nostalgia for the past. It is not just extending the life of existing deployments.

Active expansion of what Power can do and who uses it.


Join

LibrePower grows with the community. This is how you can participate:

Who is behind it?

LibrePower is an initiative of SIXE, an IBM and Canonical Business Partner with more than 20 years of experience in the Power ecosystem.

We have seen what Power can do. We’ve seen what’s missing. Now we build what should exist.

LibrePower – Unlocking the potential of Power Systems through open source software 🌍. Unmatched RAS. Superior TCO. Minimal footprint. 🌍

What is OpenUEM? The Open Source revolution for device management

In the world of system administration, we often encounter oversized, expensive and complex tools. However, when we analyze what people are most looking for in terms of efficient alternatives, the name OpenUEM starts to ring a bell.

From SIXE, as specialists in infrastructure and open source, we want to answer the most frequently asked questions about this technology and explain why we have opted for it.

OpenUEM

What is OpenUEM and how does it work?

OpenUEM(Unified Endpoint Management) is an open source solution designed to simplify the lives of IT administrators. Unlike traditional suites that charge per agent or device, OpenUEM offers a centralized platform for inventory, software deployment and remote management of equipment without abusive licensing costs.

Its operation is very good because of its simplicity and efficiency:

  1. The agent: A small program is installed on the end equipment.

  2. The server: Collects the information in real time and allows to execute actions.

  3. The web console: From a browser, the administrator can view the entire IT park, install applications or connect remotely.

OpenUEM vs. other traditional tools

One of the most common doubts is how this tool compares to the market giants. We have made a list of pros and cons with SIXE’s perspective, so you can draw your own conclusions :)

  • In favor:

    • Cost: Being Open Source, you eliminate licensing costs. Ideal for SMBs and large deployments where the cost per agent skyrockets.

    • Privacy: It’s self-hosted. You control the data, not a third-party cloud.

    • Lightweight.

  • Against:

    • Being a younger tool, it may not (yet) have the infinite plugin ecosystem of solutions that have been on the market for 20 years. However, it more than covers 90% of the usual management needs.

How to integrate OpenUEM with your current IT infrastructure?

Integration is less traumatic than it seems. OpenUEM is designed to coexist with what you already have.

  • Software deployment: Integrates natively with repositories such as Winget (Windows) and Flatpak (Linux), using industry standards instead of closed proprietary formats.

  • Remote support: Incorporates proven technologies such as VNC, RDP and RustDesk so you can support remote employees without complex VPN configurations in many cases.

If you’re wondering how to set up OpenUEM to monitor employees remotely, the answer lies in its flexible architecture. The server can be deployed via Docker in minutes, allowing agents to report securely from any location with internet.

Who offers OpenUEM support and solutions for companies?

Although the software is free, companies need guarantees, support and a professional implementation. This is where we come in. At SIXE, we don’t just implement the tool; we offer the necessary business support so you can sleep easy. We know that integrating a new platform can raise questions about pricing or deployment models for small and medium-sized businesses. That’s why our approach is not to sell you a license (there aren’t even any), but to help you deploy, maintain and secure your device management infrastructure with OpenUEM.

Contact us!

If you are looking for a platform to manage your mobile and desktop devices that is transparent, auditable and cost-effective, OpenUEM may be a solution for you. Want to see how it would work in your environment? Take a look at our professional OpenUEM solution and find out how we can help you manage the control of your devices. For those who are more curious or want to play around with the tool on their own, we always recommend visiting the official OpenUEM repository.

SIXE