Ceph Object Storage vs IBM COS: Migration Guide (2026)

Object Storage · April 2026

Ceph object storage vs IBM COS: when to migrate, and which way.

Three realistic paths for enterprise object storage at petabyte scale — and how we reach the right recommendation in each case. Fifteen years of production deployments and three live client cases on the table.

April 202611 min readInfrastructure · Open Source

In 2026, if you're running a multi-petabyte object storage deployment and thinking about the next five years, you have three realistic options: IBM Cloud Object Storage (the Cleversafe successor), upstream Ceph backed by a support partner, or commercially packaged Ceph — typically IBM Storage Ceph.

We prefer open source and say so upfront. But we've also recommended IBM COS to specific clients knowing it was the right call — and talked clients out of migrations that would have padded our invoice but complicated their operations without real gain. This article explains when and why, with real cases.

Comparativa IBM COS vs IBM Storage Ceph vs Ceph upstream — criterios de elección para migración de object storage
The landscape

The 2026 landscape, plainly

The on-premise object storage market has been reshuffling for three years. IBM has repositioned COS multiple times since acquiring Cleversafe in 2015: first as a standalone product, then pushed toward IBM Storage Ready Nodes, then folded into the "cyber vault" narrative inside the Storage Defender portfolio. Legacy Cleversafe customers — many running decade-old deployments on Cisco UCS hardware now at end of life — are asking what the next five years look like before IBM changes the message again.

Ceph, meanwhile, has done the opposite. It has consolidated. The current release, Tentacle (20.2.1, April 2026), closes a maturity cycle that started with Reef and Squid. Active contributors include CERN, DigitalOcean, Bloomberg, OVH, Clyso, Red Hat/IBM, and SUSE. It is hard to find an infrastructure open source project with more sustained momentum.

Between them sits IBM Storage Ceph: upstream Ceph packaged and commercially supported by IBM, the direct successor to Red Hat Ceph Storage. Technically the same Ceph. Commercially, a per-capacity subscription with a vendor tier-1 SLA. It exists because some clients' procurement policies mandate a named enterprise vendor, and bare upstream Ceph doesn't pass that filter — even if it is technically identical.

Three products, three business models, three distinct client profiles.

The options

The three options at a glance

IBM COS
Patented IDA (SecureSlice), closed three-tier architecture, certified hardware list. Strongest in advanced regulatory compliance environments.
Proprietary
IBM Cloud Object Storage
Cleversafe successor · ClevOS
LicenseIBM proprietary
HardwareClosed certified list
ProtocolsObject S3 / Swift
5-yr costHigh
Lock-inHigh
Ops complexityLow
IBM Storage Ceph
Upstream Ceph with IBM subscription. Same codebase, tier-1 contractual SLA. For clients who need a named vendor in the contract.
Ceph + IBM
IBM Storage Ceph
Red Hat Ceph successor · ppc64le
LicenseIBM subscription
HardwareAny x86 / ARM
ProtocolsS3 · RBD · CephFS · NVMe-oF
5-yr costMedium-high
Lock-inMedium
Ops complexityMedium

Hover each card for detail · The right option depends entirely on each client's operational reality

All three work. The differences that matter are not about what they do, but how they are operated and what they cost over five years. IBM Storage Ceph and IBM COS do not compete — they serve fundamentally different client profiles. For a deeper comparison of Ceph against Storage Scale, GPFS, or NFS, see our dedicated article: IBM Storage Ceph vs Storage Scale, GPFS, GFS2, NFS and SMB.

Our position

Why we prefer open source

It's not ideology. It's the result of seeing, project after project, that a client with a competent in-house team or a capable partner gets the same operational stability on upstream Ceph as on any commercial alternative — with significantly more freedom and at lower cost.

Proprietary lock-in is not just about hardware — it's about roadmap. If IBM repositions COS again — and it has happened multiple times since 2015 — the client watches the change from the sidelines. With Ceph, if your commercial distributor changes strategy or raises prices, you move to upstream or another distributor without migrating data. The portability is real, not marketing.

Community continuity is a guarantee no single vendor can match. A proprietary product depends ultimately on a spreadsheet at the vendor's headquarters. Ceph has enough institutional contributors that when one leaves — which has happened — the project continues. For infrastructure you plan to run for fifteen or twenty years, that matters.

Architectural versatility pays for itself. Object storage today, block tomorrow for virtualisation, file when needed, NVMe-oF when it becomes relevant. All on the same hardware, maintained by the same team. COS only does object well. Separating platforms by protocol doubles teams, procedures, and support contracts. For cases where Ceph runs as an NFS high-availability backend, we've documented the process: NFS high availability with Ceph Ganesha.

Operational transparency is its own kind of security. When something breaks in Ceph, you have the code. When something breaks in COS, you open a ticket and wait. For serious technical teams, the first is worth more than it appears in a feature comparison.

The important nuance

Open source is not free. It is different. What you save in licensing you spend in team hours — in-house or contracted. If you have neither the team nor a partner acting as its extension, the equation can reverse. That's why the operational question matters as much as the philosophical one: who operates this day to day?

Technical honesty

When IBM COS is the right answer

If we were open source absolutists we'd be selling smoke — and there's enough in this market already. COS is the correct choice for a fairly specific client profile.

Small operational teams with no deep SDS skills and no budget to hire them or outsource continuously. Ceph's learning curve is real. If the organisation can't absorb it, a packaged product like COS reduces the operational problem surface.

Regulated sectors with very specific compliance requirements — audited WORM, SEC 17a-4 retention, Compliance Enabled Vaults, NENR. IBM's ecosystem is very mature here and audits move faster when the entire stack is from one vendor with existing certifications.

Corporate "single throat to choke" policy with explicit preference for vendor tier-1. Some organisations — conservative banking, public sector, defence — where the CISO won't accept an architecture without a contractual SLA. Arguing with that policy from outside is a waste of time; the right move is helping the client choose the packaged product that fits best.

IBM ecosystem already deployed. If the client already has Spectrum Protect, Storage Defender, Fusion, Power, or Z, consolidating object storage within the same vendor makes operational and commercial sense.

Very large scale (high petabytes or exabytes) with predictable, stable workloads, where the operational simplicity of a mature product offsets the licensing cost. We've seen clients with more than an exabyte under IBM support for whom migration would be a three-year project worth tens of millions; in those cases the answer is to stay and optimise.

What doesn't justify staying on COS

Inertia, uninformed fear of open source, or taking the annual licensing line as a given without questioning it. Those we always question.

Most of the market

When upstream Ceph with a good partner is the answer

This is the scenario where we believe most of the market sits — even if it doesn't always know it.

Profiles where upstream Ceph wins clearly:

  • Client with a competent technical team in Linux and infrastructure, or willing to engage a continuous support partner.
  • Medium to large scale, from hundreds of TB to tens of PB, where commercial subscription starts to hurt the budget.
  • Need or intent to unify object, block, and file storage on the same platform.
  • Hardware refresh underway with no appetite for tying to a single vendor's certified list.
  • Native Kubernetes integration via Rook, if a cloud-native platform is on the roadmap.
  • A preference, simply, for being able to see what's under the hood.

Here we need to address a myth that has circulated for years: that Ceph is hard. It's half true. Ceph is complex — as any serious distributed system is — but it's not chaotic or unstable. The difference between a Ceph cluster that causes problems and one that runs for years without incidents is not in the software. It's in deployment design, placement group and balancer tuning, coherent hardware selection, monitoring, and having someone experienced who knows what to do when something unusual appears in ceph health detail. We have a dedicated article on the most common Ceph error and how to fix it.

The problem is not Ceph. The problem is deploying Ceph without expertise. That's a problem for any complex infrastructure, not a product defect.

The honest question. Not "can I handle Ceph on my own?" — but "do I have someone, in-house or contracted, who has my back?" If yes, upstream Ceph delivers the best cost-to-result ratio in the market. If no, find that someone before signing anything.

A well-operated Ceph cluster performs just as well with upstream support plus a competent partner as with an enterprise subscription. The real difference is who picks up the phone at three in the morning. If you're evaluating Ceph against lighter object storage alternatives, our Ceph vs MinIO 2026 article covers that in detail.

The middle option

IBM Storage Ceph: the middle option

We'll be more direct here, because this product gets written about with surprisingly little clarity.

IBM Storage Ceph is, technically, Ceph. The same Ceph you download from the project website. Packaged, tested, integrated with IBM-specific tooling, commercially supported with an SLA, and certified in several regulated environments. That is what you pay for. Technically you get nothing you couldn't have with upstream.

When it makes sense to pay for it:

  • Public or private procurement contracts that require a tier-1 vendor with contractual support, with no room for negotiation.
  • Organisations where internal purchasing policy mandates enterprise support without exception, and there is no way to qualify an external partner as a substitute.
  • Clients who already have an IBM ELA where adding Storage Ceph to the package is reasonable against list price.
  • Sectors with audits where the manufacturer's name on the invoice shortens the process.

When it's not worth it: in practically every other case. If your compliance doesn't require it and you have a decent partner, paying a subscription for upstream is an avoidable overhead. At tens of petabytes scale, the difference between a commercial subscription and a partner supporting upstream can be hundreds of thousands of euros per year. At exabyte scale, it moves to millions. For most clients, that money is better reinvested in team, hardware, or anything else.

Plain summary. IBM COS = complete product, single vendor, high cost, high lock-in, low operational complexity. IBM Storage Ceph = community Ceph with an IBM invoice, contractual reassurance, medium-high cost. Upstream Ceph with a partner = maximum control, low cost, requires maturity — in-house or borrowed.

If your reality pushes you toward the first or second, we'll be there to help you operate it well. But most clients we work with discover, after an honest assessment, that the third fits them better than they thought.

Real cases

Three real client cases

Anonymised, because NDAs are NDAs. The lesson is always the same: the right question is not "which is better in the abstract" but "which fits this specific operational reality".

Real-world cases · Three profiles, three different decisions
A
European telco operator · 50 PB · IBM COS → Ceph upstream
Cisco UCS M4 hardware at end of life; refresh on IBM-certified hardware was prohibitively expensive. COS licensing cost had been questioned internally for years. Strategic intent to consolidate object and block on a single stack for the internal Kubernetes platform. 18-month phased migration with dual-running for critical data. Outcome: significantly reduced total operational cost, client team fully autonomous, SIXE as second-level support. Three years on, the cluster remains stable.
Hardware EoLLicensing costK8s consolidation
B
Regulated financial institution · 8 PB · Stayed on IBM COS
Called us to evaluate a potential migration motivated by licensing cost. We ran the full assessment. Our recommendation was not to migrate: small operational team with no budget or culture to absorb Ceph autonomously, SEC 17a-4 Compliance Enabled Vault requirements deeply embedded in annual audits, and legitimately high aversion to operational risk. We earned less than a migration would have generated — and gained a long-term client. We continued working with them optimising the existing deployment and planning the next hardware refresh.
SEC 17a-4Small teamAnswer: stay
C
Public sector organisation · 3 PB · Self-managed Ceph → IBM Storage Ceph
Ceph deployed internally without sufficient expertise: unstable cluster, recurring incidents that had worn out the operational team. A new tender requirement mandated tier-1 vendor contractual support — upstream was off the table. We accompanied them through the migration to IBM Storage Ceph, environment stabilisation, and team training. They ended with a healthy cluster and peace of mind. Not the cheapest path, but the only viable one given the external constraints.
Tender: vendor tier-1Unstable cluster→ IBM Storage Ceph
What nobody tells you

What most comparisons don't tell you

Four things that never appear in vendor whitepapers and that we have seen trip up many technical teams.

Migrating at petabyte scale is not copying data

It's migrating configuration: lifecycle policies, retention, legal holds, ACLs, CORS, bucket policies, versioning, event notifications, tagging, replication. You migrate context as much as bytes. A poorly scoped migration project discovers this halfway through and finds its timeline has doubled.

The S3 dialect is not uniform

Between AWS S3, Ceph RGW, and IBM COS there are subtle differences in headers, LIST behaviour with large object counts, multipart upload edge cases, and versioning semantics. Client applications sometimes need adjustment. Test — don't assume.

Data protection philosophy changes between products

COS's IDA, Ceph's erasure coding, and traditional triple replication are not interchangeable in terms of durability guarantees or the failure profiles they tolerate. Translating a COS IDA 10/8/7 to a Ceph erasure coding profile requires judgment, not arithmetic.

Day-to-day operations are radically different

In COS you diagnose with storagectl list and the Manager administration shell. In Ceph with ceph -s, ceph osd tree, ceph health detail, placement groups, OSDs, CRUSH maps. Retraining a team takes six to twelve months of effective transition. Budget for it — it cannot be a project footnote.

How we work

How we work at SIXE

The approach is straightforward and has been working for years. First an assessment: we review the current architecture, actual workloads, the operational team's profile, regulatory constraints, the three-to-five-year budget, and the technically viable options. The output is a reasoned recommendation with alternatives — and sometimes it is "stay where you are". We have said that more than once.

Then a design, if there is migration or substantial change. Target architecture, phased plan, operational windows, risk matrix, runbooks. No two migrations are alike.

Then execution. Phased migration with dual-running where possible, data validation, functional QA with client applications, post-cutover tuning.

And finally handover with mentoring to the client team, plus ongoing Ceph technical support if they want us at the other end of the line going forward. Many clients prefer this model — SIXE as a team extension — over a commercial subscription. It is exactly what makes upstream Ceph viable in serious production environments. For teams that want to build internal capability, we offer a Ceph administration course and a practical IBM Storage Ceph course.

Our team diagnoses a DONT-START-DAEMON on a ClevOS slicestor with the same ease as a placement group inactive+incomplete on Ceph. We are not an "IBM partner" or a "Ceph partner". We are an object storage partner, and we know all three options well enough to recommend whichever one actually fits.


Running object storage that needs a review?

An honest technical conversation. No sales pitch.

Tell us about your current deployment — capacity, workloads, team, regulatory constraints. We'll tell you what makes sense. If the answer is "stay where you are", we'll say that too.

IBM DataStage: What It Is, How It Works and Why It Still Matters in 2026

Data Integration · April 2026

What is IBM DataStage and why it remains the enterprise ETL benchmark in 2026.

While the data integration market fragments between cloud-native tools, notebooks and trendy orchestrators, DataStage has been moving the data that matters for three decades — banking, healthcare, telecoms and government. Here's what it is, how it works and when it makes sense in 2026.

April 20269 min read

If you work with data in a large organisation, you've probably heard of DataStage — even if you're not entirely sure what it does beyond "that IBM thing for moving data". It's quite a bit more than that.

IBM DataStage is the ETL (Extract, Transform, Load) tool from the IBM InfoSphere suite. It has been in production for over 25 years, survived multiple acquisitions and rebrandings, and in 2026 it remains one of the centrepieces of IBM's data ecosystem — now also available as a service within IBM Cloud Pak for Data.

The fundamentals

What is DataStage and where does it come from

IBM DataStage is a data integration tool that lets you design, deploy and run pipelines that extract information from multiple sources, transform it according to business rules, and load it into target systems. In data engineering, this is known as ETL — Extract, Transform, Load — and DataStage is one of the most established and robust implementations on the market.

The history is worth knowing because it explains a lot about what DataStage is today. It started in the 1990s as a product from Ardent Software, was acquired by Informix, which was then bought by IBM in 2001. Since then it has been part of the IBM InfoSphere family — a suite of tools for data integration, quality and governance.

What sets DataStage apart from a Python script or an Apache Airflow flow isn't what it does (moving data from A to B) but how it does it: with a visual job design interface, a distributed parallel processing engine, native connectors for virtually any database or system, and an integrated metadata system that traces where every piece of data came from and what transformations it underwent.

In plain English: DataStage is what organisations use when they move millions of records nightly between dozens of systems, and they need it to work every time, be auditable, and not require a 15-person team to maintain.

The architecture

How it works: the parallel processing engine

The core component of DataStage is its Parallel Framework. Unlike ETL tools that process data sequentially — one record after another — DataStage distributes work across multiple partitions running simultaneously. It's the same idea as MapReduce or Spark, but implemented before those technologies existed.

┌──────────────────────────────────────────────────────┐ DataStage Parallel Engine └───────┬──────────────┬────────────────┬───────────────┘ ┌────▼────┐ ┌─────▼──────┐ ┌──────▼──────┐ Extract │ │ Transform │ │ Load │ │ │ │ Db2 │ │ Rules │ │ DWH Oracle │ │ Cleansing │ │ Data Lake SAP │ │ Enrichment │ │ Cloud APIs │ │ → parallel │ │ → batch/RT CSV │ │ → N nodes │ │ └─────────┘ └────────────┘ └─────────────┘

The clever part is that the developer doesn't have to think about parallelism. You design the job as if it were sequential — dragging stages in the Designer — and the engine decides how to partition the data, how many nodes to use and how to redistribute the load. You can force manual partitioning when you need fine control, but most of the time the engine handles it.

The stack components

  • DataStage Designer. The visual interface where jobs are designed. You drag stages (sources, transformations, targets), connect them with links, define column metadata and compile. Behind the scenes it generates OSH (Orchestrate Shell), which is the language the parallel engine executes.
  • DataStage Director. The monitoring console. You see which jobs are running, which have failed, logs, performance stats, and can relaunch or abort executions.
  • Information Server. The wrapper layer: security, shared metadata with other InfoSphere tools (QualityStage, Information Analyzer, IGC), REST API for automation, and the central job definitions repository.
  • Connectors. DataStage has native connectors for a huge catalogue: Db2, Oracle, SQL Server, PostgreSQL, MySQL, SAP, Teradata, Snowflake, Amazon Redshift, S3, Azure Blob, Kafka, flat files, XML, JSON, REST APIs — the list goes on. These are not generic ODBC wrappers — they are optimised connectors with bulk load support, pushdown optimisation and fine-grained session control.
Use cases

What DataStage is used for in practice

The real question isn't "what can it do" (moving any data between any systems) but "where does it make sense versus cheaper or more modern alternatives". Because DataStage isn't the simplest tool on the market, and the licensing cost is non-trivial. What justifies that cost are very specific scenarios.

Data warehouse loading

This is the classic case and it's still the most common. Organisations with a DWH — whether IBM Db2 Warehouse, Teradata, Snowflake or Redshift — that need clean, transformed, enriched data loaded nightly (or hourly) from dozens of source systems. DataStage shines here because of parallel processing: where a Python script takes hours, a well-designed DataStage job processes the same volume in minutes.

Data migration

When an organisation changes its ERP, core banking system or hospital information system, there's a data migration project that can last months. DataStage maps old schemas to new ones, applies conversion rules, validates referential integrity and executes massive loads with rollback. Metadata lineage is crucial here — you need to prove to audit that every migrated record has a known origin.

Real-time integration with CDC

With IBM CDC (Change Data Capture) integrated, DataStage can replicate database changes with millisecond latencies. This is used where operational data needs to be synchronised between systems in near-real-time — for example between a core banking platform and an anti-money-laundering system.

Data quality and governance

DataStage integrates natively with the rest of the InfoSphere suite: QualityStage for cleansing, Information Analyzer for profiling, and IBM Knowledge Catalog for governance and lineage. This means data governance projects that require end-to-end traceability have everything under one umbrella.

Where DataStage fits best

Banking, insurance, telecoms, healthcare, government and utilities. Industries with massive volumes, strict regulation (NIS2, PCI DSS, GDPR), and IBM Power environments where DataStage runs natively. If your infrastructure is already IBM — Power11, AIX, Db2 — DataStage is a natural fit.

The evolution

DataStage on Cloud Pak for Data: the 2025-2026 evolution

The recent history of DataStage has one clear protagonist: IBM Cloud Pak for Data. It's IBM's unified data platform built on Red Hat OpenShift, grouping all data services (DataStage, Watson Studio, Knowledge Catalog, Db2, etc.) under a common interface.

The most significant change came in June 2025 with Cloud Pak for Data version 5.2: DataStage is now available on OpenShift on IBM Power (ppc64le). This means organisations with Power servers that previously ran DataStage on the classic InfoSphere stack can now containerise it and manage it with the same orchestration as the rest of their cloud-native workloads.

The current version — Cloud Pak for Data 5.3 — brings DataStage with full ETL and ELT support, remote execution, and the new DataStage Flow Designer integrated into the Cloud Pak for Data web UI.

A note on security. In February and March 2026, IBM published several security patches for DataStage on Cloud Pak for Data 5.1.2 to 5.3.0, including command injection and sensitive information leakage vulnerabilities. If you're running DataStage on Cloud Pak for Data, make sure you're on version 5.3.1 or later.
The competitive landscape

DataStage vs the alternatives in 2026

It would be dishonest to discuss DataStage without acknowledging that the data integration market in 2026 looks very different from 2015. There are serious alternatives, and the decision depends heavily on context.

ToolModelStrong inWeak in
IBM DataStageIBM licenceParallel processing, IBM environments, regulationCost, learning curve, closed ecosystem
Informatica IDMCSaaS / on-premMarket share, connector catalogueEven more expensive than DataStage
Apache Spark / dbtOpen sourceCloud-native, flexibility, communityNot turnkey ETL, requires engineering
Talend (Qlik)CommercialEase of use, open source coreAcquired by Qlik in 2023, uncertain roadmap
Azure Data FactorySaaS AzureNative Azure integrationCloud lock-in, limited outside Azure
AWS GlueSaaS AWSServerless, low cost for small volumesCloud lock-in, limited outside AWS

When does DataStage make sense? When you already have investment in the IBM ecosystem (Power, Db2, InfoSphere), when you need on-premise parallel processing at volumes others can't handle, when regulation requires end-to-end metadata lineage, or when your team already knows DataStage and retraining costs exceed the licence.

When doesn't it? When your stack is pure cloud-native (AWS/Azure/GCP with no IBM), volumes are small, you prefer code over visual interfaces, or the budget doesn't stretch to IBM licensing and you'd rather invest in engineering with open source tools.

Getting trained

Official IBM DataStage training in Europe

If DataStage is part of your current stack or it's going to be, proper training is the difference between a team that designs efficient jobs and one that produces pipelines that take hours to run and nobody can debug.

SIXE is an IBM Authorized Training Partner and offers the following DataStage courses delivered by IBM-accredited instructors:

Both courses include official IBM materials and hands-on labs. Available on-site across Europe, remotely, or as in-company training tailored to your team. Delivered in English, Spanish and French.

Custom training

If you need a tailored course — for example, focused on migrating classic jobs to Cloud Pak for Data, or on performance tuning for a specific environment — we design it using official materials supplemented with content from our own deployments. See the full IBM training catalogue or contact us directly via WhatsApp.

Further reading


Working with IBM DataStage?

Official training. In Europe. By people who deploy it.

Whether you're starting with DataStage or want to take your team to the next level, official IBM courses delivered by SIXE cover everything from fundamentals to advanced parallel engine administration.

OpenStack Gazpacho 2026.1: VMware Replacement for Private Cloud

Private Cloud · April 2026

OpenStack Gazpacho: the release that makes the VMware exit real and your private cloud ready for what's next.

OpenStack 2026.1 Gazpacho shipped on April 1st. Over 9,000 changes, 500 contributors, parallel live migrations, intelligent bare metal automation with Ironic, and the steady elimination of Eventlet. It's the SLURP release of the year — and the most relevant one for organisations rethinking their virtualisation stack.

April 202614 min read
OpenStack 2026.1 Gazpacho release logo — the 33rd version of the world's most widely deployed open source cloud infrastructure software, the leading VMware alternative for private cloud

OpenStack just shipped its 33rd release. A few years ago, this would have been inside-baseball for private cloud operators. But 2026 is not a normal year for infrastructure. Broadcom has been reshaping the VMware landscape for over a year. European organisations are measuring the real cost of vendor dependency. And the open source private cloud has gone from an ideological stance to an economic one.

Gazpacho arrives at the exact moment when more organisations than ever are actively looking for serious alternatives to vSphere. And this release, for the first time in a while, has concrete answers to the questions those organisations are asking.

To understand what's actually new, it helps to start with something unglamorous but critical: whether you can even upgrade without disrupting production.

The release model

What SLURP releases are and why Gazpacho is the one that matters

OpenStack publishes two releases per year. Since 2022, alternating releases are marked as SLURP (Skip Level Upgrade Release Process): they allow operators to upgrade once a year, jumping directly from one SLURP to the next without touching the intermediate version.

Here's the current landscape:

ReleaseDateStatus
2026.1 Gazpacho1 April 2026Current release. SLURP. Recommended for new deployments and upgrades.
2025.2 Flamingo1 October 2025Supported. Non-SLURP (intermediate).
2025.1 EpoxyApril 2025Previous SLURP. Direct upgrade to Gazpacho supported.
2026.2 HibiscusSeptember 2026 (planned)Next release. Non-SLURP.

The official Gazpacho cycle schedule details all milestones, feature freeze dates, and responsible teams. The official release page collects the highlights selected by the community.

The practical consequence: if your environment runs Epoxy (2025.1), you can jump straight to Gazpacho without touching Flamingo. That halves your mandatory upgrades and simplifies maintenance window planning in production.

And if you're on Flamingo, it's a standard one-step upgrade.

Context

Gazpacho is the 33rd OpenStack release. Around 500 contributors from 100 organisations — Ericsson, Rackspace, Red Hat, Samsung SDS, SAP, NVIDIA, Walmart, among others — contributed over 9,000 changes during a six-month cycle. The CI/CD system (OpenDev Zuul) processed over one million jobs to validate this release. And a relevant data point for Europe: 40% of contributions came from European developers, driven by the continent's digital sovereignty initiatives. Full details are in the official OpenInfra Foundation blog post.

With the upgrade calendar clear, let's look at what actually justifies moving — beyond just closing a support window.

The headline feature

Parallel live migrations in Nova: the game changer

If we had to pick a single feature from Gazpacho to explain why this release matters to enterprises, it would be this: Nova now supports parallel live migrations.

Until now, live migration of instances processed memory transfers sequentially: one connection after another. Gazpacho introduces multiple simultaneous connections, significantly accelerating workload transfers between hosts or availability zones.

Why does this matter so much? Because migration speed is, in practice, the factor that determines how long a maintenance window lasts. For an organisation with hundreds of VMs — or one evaluating a move from VMware — the difference between serial and parallel migration is the difference between a single maintenance night and an entire weekend.

Live migration — maintenance window ↕ hover for detail
t=0 t=T t=2T t=3T t=4T

And that's not all that changes in Nova

Parallel migrations are the headline, but Nova brings three more changes that stack on the same principle: less operational friction.

Live migration with vTPM

Another feature that pairs with parallel migrations: Nova now allows live migration of instances using vTPM (virtual Trusted Platform Module) without shutting them down. Workloads with disk encryption or secure boot requirements — common in regulated environments — can now be moved between nodes without service interruption. Previously, migrating a VM with vTPM required a full shutdown-move-restart cycle. That's over.

Default IOThread per QEMU instance

A subtler change with real impact: Nova now activates a default IOThread per QEMU instance, offloading disk I/O processing from vCPUs. In high-density environments — many VMs per host — this translates to more consistent storage performance under load, without touching any configuration.

Full OpenAPI coverage in Nova

Nova achieves complete OpenAPI schema coverage: every API endpoint now has a machine-readable specification. For teams automating with Terraform, Ansible, or custom tooling, this means validating requests before sending them and reducing errors in infrastructure-as-code deployments.

Nova handles the VMs. But VMs need to live somewhere. And if that somewhere is physical hardware that hasn't gone through OpenStack before — which is exactly the situation in most VMware migrations — Ironic enters the picture.

Intelligent bare metal

Ironic: intelligent automation that cuts manual work

Ironic is OpenStack's service for provisioning physical servers — bare metal — as if they were virtual machines. In Gazpacho, Ironic receives a package of improvements that together represent a step change in day-to-day operations:

  • Autodetect deploy interface. Ironic automatically determines the right deployment method for each server, eliminating manual selection. It doesn't sound dramatic, but in a datacentre with heterogeneous hardware (different generations, different vendors) it saves hours of per-node configuration.
  • Automatic protocol detection (NFS/CIFS) for Redfish. Virtual media boot configuration is simplified: Ironic determines the correct protocol without operator intervention.
  • Trait-based port scheduling. Network assignment is automated based on real infrastructure attributes — NIC type, speed, capabilities — instead of depending on manual mappings.
  • Noop deploy interface. Perhaps the most interesting for migration projects: it allows you to register and monitor servers that are already deployed and running, without needing to reprovision them. The typical use case: onboarding existing hardware into OpenStack's inventory during a gradual migration from VMware or another platform.
In practice

The noop interface is particularly useful in the VMware-to-OpenStack migration projects we run at SIXE. It allows you to register existing ESXi hosts in OpenStack, monitor them, and plan workload migration without needing to reinstall the host OS until the time comes. It's the difference between "migrate everything over a weekend" and "migrate in phases with control".

Cyborg driver guide

Cyborg, OpenStack's accelerator service, publishes for the first time a unified driver configuration guide covering all supported types: FPGA, GPU, NIC, QAT, SSD, and PCI passthrough. For organisations incorporating AI accelerators or HPC workloads into their private cloud, having tested, unified documentation significantly lowers the barrier to entry.

VMs migrated with Nova, existing hardware onboarded with Ironic. Now for the part that always ends up more complex than expected: the network.

Networking, storage, security

What changed inside: OVN, Manila, OpenBao

Networking: native BGP and OVN improvements in Neutron

The Neutron OVN controller gains features that have been long requested in enterprise environments:

  • Native BGP support for advertising routes directly from the network controller, without external tooling.
  • North-South routing for external ports, simplifying connectivity with physical networks.
  • Allowed address pairs with virtual MACs, enabling multi-tenant scenarios and hybrid connectivity architectures.

Historically, these were scenarios where OpenStack required workarounds or third-party tools. Gazpacho solves them natively.

Storage: QoS in Manila and asynchronous volume attachments

Manila introduces QoS types and specifications that allow applying per-workload performance policies to shared storage. If you have a storage pool for databases and another for cold files, you can now set limits and priorities at the platform level, not just at the hardware level. Cinder advances asynchronous volume attachments, improving responsiveness in high-density environments.

Security: PKI with OpenBao

Integration with OpenBao — an open fork of HashiCorp Vault — for PKI certificate management in OpenStack-Ansible allows aligning OpenStack's certificate infrastructure with existing enterprise security tooling. Especially relevant in regulated environments where certificate lifecycle management is audited and where administration already uses Vault as a standard.

Watcher: cross-zone migration strategies

Watcher, the resource optimisation service, improves its workload redistribution strategies across zones with enhanced testing. For operators managing multi-site clouds or differentiated availability zones, this improves the reliability of automated redistributions during maintenance or failures.

All of the above — migrations, automation, networking — are visible features you can justify in a budget conversation. But there's work Gazpacho does silently, more important in the long run than any single feature: cleaning house.

Technical debt

Eventlet: the beginning of the end for legacy concurrency

One of the most important background stories in OpenStack over the past three years is the progressive elimination of Eventlet, a cooperative concurrency library adopted in the project's early days when Python 2 lacked a native alternative. Python 3 does have one: asyncio. But migrating a project the size of OpenStack from one concurrency model to another is a monumental effort.

In Flamingo (2025.2), four services completed the migration: Ironic, Mistral, Barbican, and Heat. In Gazpacho, Nova makes significant progress towards native threading, with measurable performance and stability improvements. Neutron also advances. And nine more projects are in active migration.

Why should an organisation that doesn't contribute code to OpenStack care? Because Eventlet is a known source of subtle bugs, debugging difficulties, and incompatibilities with modern Python libraries. Its removal makes OpenStack more stable, more maintainable, and more predictable in production over the long term. It's the kind of invisible improvement that doesn't appear in demos but makes a real difference at 3 AM when something fails and needs diagnosing.

Eventlet migration progress
● Complete ◐ In progress ○ Pending
0
Complete
0
In progress
0
Pending
0%
Complete
Overall project progress Goal: native asyncio across all services
Market context

Gazpacho as a VMware replacement: why the timing matters

We're back to where we started. We said 2026 is not a normal year for infrastructure — that Broadcom has forced organisations to rethink their virtualisation stack. The question those organisations ask is no longer "is OpenStack technically capable?" That was settled a long time ago. The question is: "does it have answers to my specific problems?" With Gazpacho, that answer has improved substantially.

Since Broadcom restructured the VMware licensing model — significant price increases, end of perpetual licences, shift to subscription bundles — the concerns we hear most often from customers are always the same. Gazpacho has a direct answer for each one, and each answer connects back to what we've just covered:

Common concernGazpacho's answer
Migration speedParallel live migrations rivalling vMotion. vTPM without downtime.
Existing hardwareIronic noop lets you register hosts without reprovisioning. Gradual migration.
Complex networkingNative BGP in OVN, N-S routing, virtual MACs. No external tooling needed.
Accelerators (GPU, FPGA)Unified Cyborg driver guide. Improved PCI passthrough.
Predictable upgradesSLURP model: one major upgrade per year, tested upgrade path.
Sovereignty / lock-inOpen project, 100 contributing organisations, 40% European.

On top of this, OpenStack now exceeds 55 million cores in production worldwide. It's not a niche project or a future promise: it's infrastructure running in production at Walmart, CERN, NTT, Deutsche Telekom, and thousands of smaller organisations that don't publish their numbers. The project has been running for 15 years and the contributor base is growing, not shrinking.

The European angle

40% of Gazpacho contributions came from European developers. This is no accident: the EU's digital sovereignty initiatives are driving adoption of open infrastructure in public administrations and regulated enterprises. For organisations that need to demonstrate technological independence in their procurement or audit processes, OpenStack is one of the few options that combines technical maturity with genuinely open governance.

Next steps

How this fits your infrastructure

Release model, parallel live migrations, intelligent bare metal, networking without workarounds, legacy debt winding down, favourable market context. Everything converges on the same practical question: what do you do now? The answer depends on where you are.

There are three common situations, and the right diagnosis differs for each:

  • You have OpenStack in production and need to plan the upgrade to Gazpacho. If you're on Epoxy, the direct SLURP-to-SLURP upgrade is your path. If you're on Flamingo, it's a standard one-step jump. In both cases, the recommendation is to validate in a test environment, especially if you use features that have changed between versions (migrations, OVN, Ironic).
  • You have VMware and the renewal cost has you looking at alternatives. The question here is one of design: which workloads move first, what OpenStack architecture fits (on what hardware, with what storage — Ceph is the natural choice), and how to migrate data without stopping the service.
  • You're starting a new project — private cloud, data platform, AI infrastructure — and OpenStack is on the table. Gazpacho is the right version to start with, using Ceph for storage and, if you need containers, Kubernetes integrated via Magnum.

In all three cases, the sensible order is the same: a short technical conversation to understand the situation, an architecture proposal with clear options and trade-offs, and a validation lab before touching production. What changes with Gazpacho is not the process — it's that there are more mature pieces to work with.

Further reading


Evaluating OpenStack or planning an upgrade?

Let's talk about your infrastructure.

Tell us where you are — pending upgrade, VMware exit, new project — and we'll come out of the call with a clear picture of architecture, effort, and next steps. No generic quotes. Directly with engineers who've had their hands inside projects like yours for years.

What is Wazuh? The Open Source SIEM Alternative to Splunk and QRadar

SIEM & XDR · April 2026

What is Wazuh and why it's the real alternative to Splunk and QRadar in 2026.

In two years, Cisco bought Splunk for $28 billion and Palo Alto Networks bought IBM's QRadar SaaS assets. Meanwhile, Wazuh kept publishing releases, launched threat hunting with a local LLM, and crossed 10 million annual downloads. Here's why that changes the SIEM conversation in 2026.

April 202612 min read

There's a conversation we've been having several times a month at SIXE since late 2024. A head of IT or a CISO calls us, usually with a bit of fatigue in their voice, and says some version of the same thing: "our Splunk contract is up and next year's budget won't cover it", or "we're on QRadar SaaS and IBM just told us we need to migrate to Cortex XSIAM, but we're not sure that's what we want".

It's not a coincidence. The commercial SIEM market has changed more in 24 months than it had in the decade before. And in the middle of all that movement, there's one piece still standing exactly where it was. It isn't listed on any stock exchange, nobody has acquired it, and in the meantime it just keeps growing. It's called Wazuh.

Context 2024-2026

The SIEM map that changed in 24 months

If you've been in defensive security for a while, you know the commercial SIEM market has always been conservative. The big players moved slowly. Customers put up with painful contracts because migrating a SIEM is a serious project, and nobody does it for fun. And then, between the spring of 2024 and the summer of 2025, three things happened that broke that equilibrium.

March 2024 — Cisco buys Splunk for $28 billion

The most expensive move in the history of SIEM. Cisco paid $157 per share, well above what Splunk had been trading at earlier in the year. Before closing the deal, Splunk laid off 7% of its workforce — around 560 people — in a global restructuring. Analyst surveys right before the acquisition already suggested that 22% of customers were considering switching vendors if prices went up after the deal. Those of us who've watched this kind of acquisition play out before know what usually comes next: internal pressure to justify the high purchase price, roadmap shifts, and increasingly tense renewal cycles.

May-September 2024 — Palo Alto Networks buys QRadar SaaS assets

This is the one nobody saw coming. In May 2024, IBM and Palo Alto Networks announced that Palo Alto was acquiring IBM's QRadar SaaS assets for around $500 million, with closing confirmed in September. Forrester summed up the implication in a sentence that QRadar customers are still digesting: when contracts expire, QRadar SaaS customers have to migrate to Cortex XSIAM or move to another vendor. It isn't an opinion, it's the official transition plan.

IBM keeps supporting QRadar on-premise — bug fixes, critical updates, new connectors — so customers with their own installations aren't left stranded overnight. But the underlying message that security committees are reading is clear: the heavy investment is no longer going toward QRadar, it's going toward XSIAM and Precision AI. Many are using that signal to rethink their entire SOC strategy for the medium term.

Meanwhile — Wazuh crosses 10 million downloads a year

This didn't make headlines, because Wazuh doesn't have a PR machine anywhere near the scale of Cisco or Palo Alto. But the numbers are there. According to figures the project publishes itself, Wazuh crosses 10 million annual downloads, maintains one of the largest open source security communities in the world, and in June 2025 rolled out a feature none of its commercial competitors yet offers without a separate licence fee: threat hunting powered by a large language model running locally. We'll come back to that.

An important note on QRadar. At SIXE we still provide official IBM QRadar support and training, and we'll keep doing it as long as there are customers with active deployments. QRadar on-prem is still a solid tool for teams that already have it running and want to get the most out of it. But if you're kicking off a new SIEM project in 2026, or you have QRadar SaaS and the contract is coming up for renewal, the conversation that makes sense right now is a different one. And it runs through Wazuh a lot more often than it did two years ago.
The product

What is Wazuh (beyond the "it's free" line)

If Wazuh were just a cheap log collector, this would be a different conversation. It isn't. Wazuh is a platform that unifies, inside a single agent and a single stack, a long list of functions that the rest of the market sells as separate boxes: SIEM, XDR, endpoint detection, file integrity monitoring, CVE vulnerability scanning, configuration assessment against CIS benchmarks, regulatory compliance and active response. All from the same agent running on Linux, Windows, macOS, Docker containers or virtual machines.

Here's what actually happens, in practice, when a Wazuh agent is deployed on one of your servers:

  • Collects and correlates logs in real time. Syslog, auditd, Windows Event Log, application logs — all with native decoders. Ships them encrypted to the central manager, where rules are evaluated, events are correlated across hosts, and alerts fire when they should.
  • Monitors file and configuration integrity. Any change in /etc, the Windows registry, system binaries or files you've marked as sensitive triggers an immediate alert. This is tamper detection, and it's one of the things you used to have to buy separately.
  • Scans for vulnerabilities against updated CVE databases. Wazuh cross-references the installed package inventory with vendor feeds and official CVE sources, and tells you which machines need patching and at what priority. No need to pay for Tenable or Qualys on top.
  • Audits configuration against CIS Benchmarks. Each agent runs periodic hardening evaluations against CIS policies or your own internal policies, and produces compliance reports ready to present to an auditor.
  • Responds actively. Automatic IP blocking, process kills, host isolation, custom script execution. No one touches the keyboard at three in the morning.
  • Maps everything to MITRE ATT&CK. Every fired rule is tagged with the corresponding ATT&CK technique and tactic, which makes SOC analyst dashboards far more useful than the generic panels most tools ship with.
┌──────────────────────────────────────────────┐ Wazuh Manager (analysis engine · rules · response) └──────┬──────────────┬──────────────┬─────────┘ ┌────▼─────┐ ┌─────▼─────┐ ┌────▼──────┐ Agents │ │ Indexer │ │ Dashboard linux │ │ cluster │ │ windows │ │ OpenSearch│ │ MITRE macos │ │ │ │ compliance docker │ │ → shards │ │ SOC view k8s │ │ → HA │ │ └──────────┘ └───────────┘ └───────────┘

The stack is solid and battle-tested in production. An academic paper published by Springer in April 2026 evaluated distributed Wazuh architectures with high availability and sustained ingestion rates well above the average EPS baseline, and concluded — with the usual careful wording academic papers use — that well-designed open source SIEM solutions can match and in certain aspects surpass commercial platforms. Put in plain English: when somebody who isn't selling Wazuh evaluates Wazuh methodically, the results hold up.

The 2025 headline feature

The ace up the sleeve: threat hunting with a local LLM

In June 2025, almost without fanfare, Wazuh rolled out a capability that changes the way a SOC analyst can work: threat hunting assisted by a large language model running locally. Not in OpenAI's cloud. Locally. In your own infrastructure.

Why does it matter? Because all of the "SIEM with AI" options the commercial market has launched — Cortex XSIAM with Precision AI, Splunk's own AI suite, QRadar's late innovations before the sale — work by sending your logs to the vendor's models. And in many cases, that's precisely the thing the customer legally can't do. If your logs contain patient records, banking data, or classified information from a public administration, shipping them off to a third-party LLM in somebody else's cloud is not a conversation — you just can't.

Wazuh's approach sidesteps that problem entirely. You choose the model. You deploy it where you want. Your data stays where it is. And the queries look exactly like what an analyst would phrase naturally: "show me all privilege escalation attempts from the last month correlated with service accounts", "summarise the events on this host in the last 24 hours and prioritise anything anomalous", "is there anything in these logs that looks like MITRE T1078?".

Our take at SIXE

This is exactly the line we've been working on from the infrastructure side for a while — LLMs running on-premise, never shipping anything to someone else's cloud, for environments that handle sensitive data. We've applied it on IBM Power, on AIX, and on Ceph-plus-Kubernetes clusters built for private inference. When we saw Wazuh moving in the same direction from the SOC side, it was one of the things that made us double down on the platform. If you want the infrastructure side of that story, we cover it in detail on our on-premise AI inference page.

The comparison

Wazuh vs Splunk vs QRadar vs XSIAM in 2026

Cutting through the marketing noise, here's the current state of the four players that come up in most of the conversations we have. All figures and statuses are verifiable as of the publication date of this post.

PlatformStatus 2026Commercial model
WazuhIndependent. No acquisitions, no funding rounds, growing downloads and community.AGPLv3 open source. No licence cost. Wazuh Cloud optional.
SplunkOwned by Cisco since March 2024. 7% workforce reduction pre-close. Integration in progress.Per GB ingested per day. High cost, renewal pressure rising.
QRadar SaaSSold to Palo Alto Networks in 2024. Mandatory migration to Cortex XSIAM when contracts expire.Destination is Cortex XSIAM. Free migration for "qualified customers".
QRadar on-premMaintained by IBM. Bug fixes, connectors, no major new features.IBM licence per EPS. Official support still active.
Cortex XSIAMPalo Alto Networks' strategic product. Integrated AI (Precision AI).Per capacity and features. Positioned at the top of the price range.
Pure ELK / OpenSearchFree, but you build it yourself: rules, decoders, FIM, compliance.Free stack. The real cost is in your own engineering time.

The interesting thing about this table isn't in any single column — it's in what reading the whole thing implies. Four of the six commercial players are in transition, in maintenance mode, or in mandatory migration. Wazuh and ELK are the only ones sitting exactly where they were three years ago, with communities intact and public roadmaps. And of those two, only one ships with SIEM, XDR, FIM, vulnerability scanning, active response and compliance out of the box: Wazuh.

A note on cost. When we compare Wazuh to Splunk in technical sessions with clients, the discussion almost never ends up being about the licence cost — which, yes, is much cheaper. It usually ends up being about predictability. Splunk grows with you: more data ingested, more you pay. Wazuh doesn't. And in an environment where logs grow 30-40% a year — because you're adding new services, because GDPR or NIS2 is forcing you to retain more, because you're running more containers — that difference translates into a bill CFOs read very clearly.

The regulatory angle

Compliance without the pain: GDPR, HIPAA, PCI DSS, NIS2, ISO 27001

There's a very practical reason Wazuh is growing fast in regulated sectors: compliance pressure. GDPR has been in effect for years and keeps getting enforced harder. The EU's NIS2 directive is now being actively transposed across European member states, widening the perimeter of organisations legally required to demonstrate detection, response and resilience. On the other side of the Atlantic, HIPAA audits are taking file integrity monitoring and continuous configuration assessment much more seriously than they used to. And PCI DSS is still PCI DSS. For a lot of mid-sized organisations — hospitals, universities, essential-service operators, financial services — the question isn't whether they need a SIEM anymore. It's which one they can afford without the finance committee raising an eyebrow.

Wazuh ships with dashboards and reports mapped directly to the major regulatory frameworks:

  • GDPR. Event logging controls, data access tracking, incident detection and response, evidence for the breach notification clock.
  • HIPAA. Security rule controls for audit logging, access tracking, integrity of protected health information, incident reporting.
  • PCI DSS. Logging, file integrity, vulnerability management and retention controls — the standard's checklist, mapped requirement by requirement.
  • NIS2. Detection controls, incident traceability, reporting to competent authorities, evidence of risk management measures for essential and important entities.
  • ISO/IEC 27001. Evidence for Annex A controls around operations, communications, compliance and security incident management.
  • CIS Benchmarks. Continuous hardening audits for operating systems and services, with historical drift reporting.

That said — and we say this with affection because we come from this world — dashboards alone don't pass an audit. What passes an audit is that somebody has designed the architecture properly, that the rules are tuned to the client's context, that exceptions are documented, and that the evidence trail gets to the person who has to sign it off in a shape they can actually use. That part isn't the product. It's the team deploying it. And it's probably 70% of the value of a Wazuh project done well.

What we do at SIXE

We've been deploying Wazuh in organisations subject to GDPR, NIS2, PCI DSS and sector-specific regulations for years, across Europe and internationally. The full service page — architecture, deployment cycle, SLAs and use cases — is here: Wazuh implementation and support. If compliance is what's pressing you the most right now, that's the conversation to start with.

The migration

Migrating from Splunk, QRadar or pure ELK without blinding the SOC

Migrating a SIEM is a scary project, and rightly so. A badly migrated SIEM leaves your detection controls blind at exactly the wrong moment. That's why the way we do it has to be boring and predictable, with three principles we don't negotiate on:

  1. Never turn off the old SIEM before the new one is actually working. The old one keeps swallowing logs and firing alerts while Wazuh starts running in parallel. For a few weeks you have double coverage and zero risk. That period is expensive in resources, sure, but a lot cheaper than a month of SOC running blind.
  2. Convert the critical rules first, not the whole catalog. Big SIEMs tend to have thousands of accumulated rules, and a significant fraction of them are rules nobody looks at or that fire false positives. The first pass identifies the 50-150 critical rules that actually produce useful detections, rewrites them in Wazuh's format, and validates them against real events. The rest comes later — or doesn't, because a lot of the time it isn't worth it.
  3. Validate with events that actually hurt, not with synthetic tests. Before we consider Wazuh operational, we reproduce a set of real scenarios — privilege escalation, exfiltration attempts, early-stage ransomware behaviour, account compromise — and check that alerts fire, correlate and reach the SOC with the right context. If they don't, it isn't considered operational. It's that simple.

The part that changes depending on where you're coming from is the conversion work:

  • From Splunk. The most interesting work. SPL (Search Processing Language) doesn't translate automatically into Wazuh rules, but the detection pattern is usually reproducible with custom decoders and rules on top of OpenSearch. We've done several of these migrations and the bulk of the work is in dashboards and rules, not ingestion.
  • From QRadar. The good news is that QRadar and Wazuh share a lot of philosophy around events and offenses. The bad news is that QRadar's DSMs are proprietary and you need to rebuild the parsers. If you're on QRadar SaaS with the XSIAM migration looming, this is a reasonable window to seriously evaluate the third option.
  • From pure ELK. The easiest of the three — Wazuh already uses OpenSearch as its indexer, so you already know a lot of the data stack. The jump is in adding rules, compliance and active response, which in pure ELK you would have had to build by hand.
Next steps

Where to start

If you've read this far and you're thinking "this applies to me more than I'd like", you're probably right. You don't need a huge project to take the first step. The most useful starting point is usually a short conversation around three questions:

  • Where exactly are you right now? On Splunk with renewal coming up? On QRadar SaaS with the XSIAM migration on the horizon? Nothing yet, and GDPR or NIS2 starting to press?
  • What regulation is actually forcing you? GDPR, HIPAA, PCI DSS, NIS2, ISO 27001 — covering one isn't the same as covering all four, and Wazuh's architecture scales differently depending on which ones genuinely apply.
  • How many endpoints, which operating systems, which logs do you already have, which integrations do you need? With those data points we can already sketch a concrete design and a realistic effort estimate.

When you have a clearer idea of what to look at, the full service page — with the architecture, modules, detailed comparison with commercial alternatives, and deployment cycle — is here: Wazuh implementation and support by SIXE. And if you'd rather just talk to somebody who's had their hands inside projects like yours, the 30-minute technical session is free and no-strings. You walk out with a rough architecture, a realistic effort estimate, and the next steps. If Wazuh fits, we'll say so. If it doesn't, we'll say that too.

Further reading


Rethinking your SIEM?

Book a technical session. No commitment.

Tell us where you are, what regulation applies, and what's keeping you up at night. You'll walk out of the call with a rough architecture, an effort estimate, and clear next steps. No generic quotes, no sales pitch — just someone from the engineering team.

Bacula: immutable backup that survives ransomware, flat pricing

Backup & Resilience · April 2026

Bacula: the backup that actually survives ransomware. Without charging per TB.

Modern ransomware groups don't go for your servers first. They go for your backups. Here's what that means in practice, what a truly immutable backup really is, and why more and more of our clients are walking away from Veeam and Veritas straight into Bacula.

April 202610 min read

For years, backup was the boring job in the data centre. You set it up on a quiet Friday afternoon, watched a couple of green ticks, and nobody looked at it again until something broke. We've all been there.

That comfortable neglect isn't an option anymore. The backup server is now the first thing modern ransomware groups hunt for, and the reason is brutally simple: if they take it down before encrypting anything, your company has no way out. You pay, or you lose the data. Meanwhile Veeam, Veritas and Commvault keep charging by the terabyte — and Veeam in particular has been raising prices by 4-8% every single year, on top of the quiet cumulative hikes that have pushed some customers' bills up by 25% or more in four years. There's an alternative most people haven't taken seriously enough.

The 2026 reality

Your backup is no longer the last line of defence. It's the first victim.

In late March, Bacula Systems themselves published two articles that capture where this conversation sits today. One explains why backup infrastructure is a more valuable target to attackers than domain admin accounts, and the other lays out why Zero Trust breaks down at the backup layer. None of this is news to incident responders who've been saying the same thing for a couple of years. But seeing it documented in writing changes boardroom conversations that used to be impossible.

When clients call us to help clean up an incident, the pattern they describe is almost always the same, with small variations:

Day 1-3 Initial access (phishing, VPN flaw, 0-day) Day 4-10 Reconnaissance + privilege escalation Day 11-14 Locate and compromise the backup server Day 15 Wipe catalogs, credentials, repositories Day 16 Launch mass encryption Day 17 Ransom note on the intranet

Look at that timeline for a second. Attackers spend more days neutralising the backup than actually running the encryption. It's not a mistake on their part. They know that if the backup survives, the victim doesn't pay, and then the whole attack loses its business case. It's cold criminal accounting.

This is why "do we have backups?" stopped being a useful answer to anything. The question you actually need to ask is uncomfortable: would those backups still be there if an attacker already had admin credentials inside your network? If you need to think about it, the honest answer is probably no.

The uncomfortable number. The resilience reports that came out throughout 2025 suggest that over one-third of ransomware victims discover, mid-crisis, that their backups were also hit. And here's the part that haunts us: in most of those cases, the backup system was "working fine" the day before. Green dashboard, nightly success emails, last backup completed at 3am. Everything was perfect. Right up until it wasn't.
The real definition

What "immutable backup" actually means

Quick heads up: "immutable backup" has become marketing wallpaper. Pretty much every vendor claims to have it, and most of them mean wildly different things by it. So let's be precise about what we're talking about.

A backup is genuinely immutable when, once it's been written, nobody can modify or delete it before its retention expires. Nobody. Not the user who created it. Not your DBA with valid credentials. Not your own root account. And crucially, not an attacker who's already compromised the backup server itself. If any of those people can destroy the copy, the copy isn't immutable. It's a regular backup wearing a fancy label.

To get that guarantee in practice, there are three serious approaches, and Bacula supports all three:

MechanismWhat it isWhat it's for
WORM on LTO tapeLTO cartridges in Write-Once-Read-Many mode. The drive firmware physically refuses to overwrite them.Real air-gap. Eject the tape, drop it in a vault, forget about it. No network reaches it.
Object Lock on S3 / CephObjects locked through the API with a retention period bound to the object itself. Not even the bucket root can touch them.Immutability for object storage. Works with MinIO, Ceph RGW, AWS S3, Azure Blob.
Append-only filesystemsZFS or Btrfs with snapshots the backup process itself can't modify or delete.First line of defence on local disk. Useful for smaller environments but no substitute for true air-gap.

The rule a lot of CISOs have been quietly adopting is called 3-2-1-1-0. You've probably come across it if you've been in the conversation recently: three copies, two different media, one offsite, one immutable, and zero errors in verification. That "1-0" is what changed from the classic 3-2-1 we all grew up with. Having copies isn't enough anymore — they need to be unalterable, and they need to be verified automatically without anyone having to remember to check. And here's the thing: Bacula does both as part of its normal operating cycle. It's not a premium add-on you license separately.

Why tape is suddenly cool again

Back in January, the LTO consortium announced LTO-10: 30 or 40 TB native per cartridge (up to 100 compressed), 400 MBps, and the same physical air-gap as LTO-1 had 25 years ago. Tape is still the only backup technology where an attacker with root on your Bacula server literally cannot reach the data, because the data isn't connected to anything. At SIXE we've been watching clients who threw out their tape libraries a few years ago come back and buy new hardware. It's not nostalgia — it's risk maths.

LTO tape library in a datacenter rack used as immutable air-gap storage for Bacula backups
A modern LTO library — the one place ransomware genuinely cannot reach.
Why Bacula

Why Bacula is such a good fit for this

Bacula has been protecting data for more than twenty years in banks, government agencies, hospitals, telcos and scientific data centres at petabyte scale. It's probably the most mature open source enterprise backup software out there, and we're not the ones saying it — the installs that have been running since the early 2000s say it, and so do the vendor's customers who've been trusting it with critical data for two decades. The current commercial flagship is Bacula Enterprise 18.0.8, and it brings things that would have sounded like science fiction five years ago: BGuardian for catalog poisoning detection, a security dashboard in BWeb, native support for Microsoft 365 and Azure, a Nutanix AHV plugin, CSI snapshots for Kubernetes.

But the real reason Bacula holds up so well against modern ransomware isn't the new plugins. It's a set of design decisions the project made twenty years ago that, as it turns out, have aged very well:

  • Distributed by design. Director, catalog, storage daemons and clients are all separate processes. They can live on different hosts, with different credentials, on different network segments. Compromising one of them doesn't hand you the others — nowhere close.
  • Catalog in a real database. PostgreSQL or MySQL with their own backups, their own access controls, their own audit trails. You can replicate it, put it behind its own firewall, treat it as the critical asset it actually is. It's not some binary file sitting in /var.
  • Automated content verification. Bacula periodically re-reads volumes and compares checksums. If something's been tampered with — by a human, by a bad disk sector, by some weird hardware bug — you find out long before you need that emergency restore. This sounds boring but it's the difference between a scare and a catastrophe.
  • Genuine open source with no asterisks. The volume format is documented. If SIXE goes away tomorrow, or if Bacula Systems goes away, your backups are still recoverable with community open source tools. You can't say that about almost any other vendor.
  • LTO, Object Lock and air-gap built in since forever. These aren't recent features that got bolted on because ransomware became fashionable. They've been the natural way to run Bacula for decades. When the rest of the market woke up to the fact that tape mattered again, Bacula was already there.
┌────────────────────────────────────────────────┐ Bacula Director (orchestration + catalog) └───────┬────────────┬────────────┬──────────────┘ ┌────▼────┐ ┌─────▼─────┐ ┌────▼─────┐ Clients │ │ Storage │ │ Catalog Linux │ │ Daemons │ │ Postgres VMware │ │ │ │ Proxmox │ │ → LTO WORM│ │ IBM i │ │ → Ceph S3 │ │ DB2 │ │ → Disk ZFS│ │ └─────────┘ └───────────┘ └──────────┘
The money side

Why Bacula gets cheaper every year (and the big vendors don't)

The price gap between Bacula and the big proprietary players isn't some discount trick. It comes down to the licensing model itself. Veeam, Veritas NetBackup and Commvault all bill based on how much data you protect, or how many hosts you have, or some creative combination of the two. The result is always the same: when your company grows, your backup bill grows at exactly the same rate. And companies grow every year.

Bacula Systems, on the other hand, licenses by number of clients and the plugins you activate. The price stays flat even if your data doubles. The first time a CFO realises this in a meeting with us, the silence across the table is very revealing.

It gets worse on the other side. Veeam has announced price increases of 4-8% for both January 2025 and January 2026, and that's just the headline number — customers on the Veeam forums regularly report cumulative increases of 25% or more across 2021-2024, with some individual renewals coming back at +49%. People in the industry have started comparing it to what Broadcom did to VMware after the acquisition. It's not unfounded.

ModelWhat you pay forWhat happens when you grow
VeeamPer protected workload, per socket, or per instanceRises with host or VM count — plus 4-8% yearly hikes
Veritas NetBackupFront-end capacity (TB of data to protect)Rises with data volume
CommvaultMix of capacity and active modulesRises with both data and feature adoption
Bacula EnterprisePer number of clients and pluginsFlat. Data grows, cost doesn't.
Bacula Community + SIXEFree software + engineering hours from usYou only pay for what you use of our team.

In the migrations we've run from Veeam or Veritas, first-year savings typically land between 60% and 85%. The sweet spot is mid-sized environments, roughly 20-200 TB protected — that's where proprietary capacity pricing hurts the most and where Bacula shines the brightest. And the gap widens every year, because your data keeps growing and Bacula doesn't punish you for it.

A word on the numbers. The actual savings depend on your current contract, your host count, which modules you use, whether you have tape, what your storage footprint looks like. There's no magic figure. What we do at SIXE is take your current annual backup bill and run it side by side with a realistic Bacula plus support estimate, using your real numbers, before you commit to anything. And if the savings don't justify the move, we'll tell you. We've made a few friends that way.
The implementation

How to design a Bacula that survives an attack

Installing Bacula isn't the hard part. Designing a Bacula that survives someone who already has root inside your network — that's where the real engineering happens. These are the five architectural decisions that make the difference between a backup that saves the company and one that goes down with it:

1. Separate the control plane from the data plane

The Bacula director and its PostgreSQL catalog should not live on the same network as the clients they protect. There needs to be a separate admin network with its own firewall rules, its own MFA, and ideally a zero-trust architecture applied to the backup servers themselves. Sounds obvious — but we see this done wrong most of the time we go in to audit someone else's install.

2. At least two truly independent repositories

One fast repository on disk, or on Ceph with Object Lock, for day-to-day restores. And a second air-gap repository for long retention: LTO tape, or a physically isolated Ceph cluster in another location, or both. The 3-2-1-1-0 rule isn't an optional suggestion, it's the minimum you should reasonably aim for. If your current backup plan fits in a single repository, you have a problem.

3. Automated verification. Actually turn it on.

Bacula lets you schedule jobs that re-read volumes and compare checksums. Use them. Seriously. It's the only way to catch tampering, media degradation, or weird hardware errors before you find them out on the worst possible day — the day you need an emergency restore. Our usual recommendation is weekly at minimum, more often where the infrastructure can handle it.

4. Retentions you can't shorten after the fact

Retention periods are defined in the director and "sealed" at write time on immutable volumes. Once a volume has been written to a WORM tape or an Object Lock bucket, there's no Bacula console and no database client that can shrink that retention. The volume is going to exist until it expires, end of story. This is exactly what protects your backup from a disgruntled insider or an attacker with valid credentials.

5. A recovery plan that's written down and actually rehearsed

A backup you never restore doesn't exist. Bacula makes the restore side genuinely manageable, but rehearsing the drill — with timings, ownership, decision points — is the team's job. At SIXE we bake this into the client's Disaster Recovery Plan, with actual drills scheduled at least twice a year. There's no shortcut for this one.

Where most projects stop

These five points are exactly where almost every Bacula project we get called in to review has stopped halfway. None of this is on by default — they're engineering decisions that depend on your specific infrastructure, your regulatory framework, how many people touch the backup, what your budget looks like. When SIXE walks into a Bacula project, this is literally what we do. And it's usually the part that adds the most value, far more than installing the package itself.

The migration

Migrating from Veeam or Veritas without losing a single byte

The objection we hear on almost every first call is word for word the same: "we've got seven years of history in Veeam, we can't afford to lose it." And they're absolutely right. In banking, healthcare, public administration or any environment governed by GDPR, ISO 27001, HIPAA or sector-specific frameworks, long retention isn't a preference — it's a legal requirement. The good news is there's no technical reason to lose anything. You just have to design the migration properly.

At SIXE we do it in three phases:

  1. Coexistence window. We leave the old system running at full tilt, doing its daily backups exactly as before. Bacula starts protecting in parallel. For a few weeks you have double coverage and zero risk. This tends to reassure security teams that were raising eyebrows at the start.
  2. Real restores before we switch anything off. This is where migrations are actually won or lost. We don't do the classic test VM restore — we restore workloads that would genuinely hurt if they failed. A critical database. A regulated file server. When those restores confirm the design works, we move to the next phase. Not before.
  3. Read-only historical archive. Your old system's history doesn't get touched. It stays accessible for as long as your retention policy needs, in read-only mode, as a live archive. If someone asks for a 2023 file next year, it's still there. And Bacula handles everything new in the meantime.

This is honestly one of the services we're being asked for the most right now. It's a perfect storm: proprietary licensing costs going up every year, the new pressure around immutable backups, and a growing realisation that backup has been quietly sitting in the "unaddressed risk" pile for too long. If you want the full breakdown of the service — SLAs, use cases, sector examples — it's all here: Bacula support and migration services.

Next steps

Where to start today

If you've read this far and you're thinking "I should probably take another look at our backup", you're almost certainly right. You don't need to launch a huge project to take the first step. Honestly, the most useful starting point is usually a short audit answering three questions honestly:

  • Are our backups actually immutable? If an attacker had admin credentials on my network right now, could they destroy the backups?
  • Are they being verified automatically? Or will I only find out they were corrupted on the day I actually need a restore?
  • How much am I paying per protected TB per year, and what will that number look like in three years?

You can answer all three with your current team, without calling anyone. If any of the answers make you uncomfortable, let's talk. No sales pitch, no slides, directly with someone on the engineering team who's had their hands inside projects very much like yours.

Further reading


Would your backup survive today?

Let's look at your backup architecture together

No salespeople, no generic presentations, no commitment. Just a technical call with someone from the SIXE team to see where you are and what actually makes sense to change. If Bacula fits, we'll say so. If it doesn't, we'll say that too.

Run an LLM on IBM i via PASE — No Linux Required

IBM i · March 2026

We Ran an LLM on IBM i. No Linux. No Cloud. No GPU.

llama.cpp compiled for AIX runs natively on IBM i via PASE. Your RPG programs can call a local language model without adding infrastructure or sending data anywhere.

March 20268 min read

If you manage an IBM i system, you know how this conversation goes. Someone asks about AI, and the answers are always the same: "spin up a Linux LPAR", "use OpenAI", "check out Wallaroo". Every option means leaving the platform, adding layers, and at some point sending business data to a server you don't control.

There are 150,000 IBM i systems processing transactions in banking, insurance, and healthcare. The answer can't always be "add more infrastructure". So we tried something different.

The experiment

What we actually did

We took llama.cpp — the most widely used open-source LLM inference engine — compiled it for AIX, and copied the binary to an IBM i V7R5 partition. We ran it via PASE. It worked on the first try.

$ uname -a
OS400 WWW 5 7 007800001B91

$ /QOpenSys/pkgs/bin/python3 -c "import platform; print(platform.platform())"
OS400-5-007800001B91-powerpc-64bit

$ /QOpenSys/pkgs/bin/python3 -c "import sys; print('Byte order:', sys.byteorder)"
Byte order: big

That's IBM i V7R5 on pub400.com — a public IBM i system. Big-endian, powerpc-64bit, OS400. Not Linux, not AIX. IBM i.

What kind of binary

$ file llama/llama-simple
llama/llama-simple: 64-bit XCOFF executable or object module

A 64-bit XCOFF binary — the native executable format for AIX. Compiled on AIX 7.3 POWER using GCC 13.3 with VSX vector extensions enabled. The same binary from our llama-aix project, which already ships 10 big-endian GGUF models on HuggingFace.

First run

$ LIBPATH=/home/HBSIXE/llama /home/HBSIXE/llama/llama-simple --help

example usage:

    /home/HBSIXE/llama/llama-simple -m model.gguf [-n n_predict] [prompt]

The binary loads, links libggml and libllama, parses arguments, and responds. All inside PASE. To run actual inference, you point it at a big-endian GGUF model:

$ LIBPATH=/home/HBSIXE/llama /home/HBSIXE/llama/llama-simple \
    -m models/tinyllama-1.1b-q4_k_m-be.gguf \
    -p "What is IBM i?" -n 100 -t 4
IBM i PASE terminal running llama.cpp: the XCOFF binary loads, links libraries and responds to a prompt in real time
The context

Why this matters for IBM i shops

In 2026, the AI conversation in the IBM i community is louder than ever. IBM just launched Bob (the successor to WCA for i), a coding assistant for RPG developers. 70% of IBM i customers plan hardware upgrades this year. And yet there's one question that still doesn't have a clean answer:

How do I integrate an LLM into my IBM i applications without depending on an external service?

The usual options, right now:

OptionWhat it meansThe catch
Linux LPARSpin up a separate partition, run the LLM there, call it from RPG via APINew hardware to manage, added cost, data crosses partition boundaries
Cloud APICall OpenAI, Azure, or AWS from RPGBusiness data leaves the machine. A serious problem in banking, insurance, and healthcare
WallarooOption 1 packaged as a service$500/month. Still a Linux LPAR with branding
PASE + llama.cppThe LLM runs inside IBM i itself, via PASENo extra hardware. Data never leaves the partition.
What about IBM Bob?
Bob is for the developer: it helps understand, document, and generate RPG code from the IDE. What we describe here is for the production application: an LLM running in the same partition that any RPG program can call like a local API. They solve different problems. Bob for the dev workflow, local inference for the apps themselves.
The technical foundation

PASE: the bridge you already have

PASE (Portable Application Solutions Environment) is a runtime built into IBM i that executes AIX binaries natively. It's not emulation — it's a layer that exposes AIX system calls directly on top of the IBM i kernel. If something runs on AIX, it can run on IBM i via PASE.

┌──────────────────────────────────────────┐ IBM i (OS400) │ ┌──────────────┐ ┌────────────────┐ │ │ │ RPG / CL │ │ PASE │ │ │ │ COBOL / Db2 │───→│ (AIX runtime) │ │ │ │ │ │ │ │ │ │ localhost │ │ llama-server │ │ │ │ :8080 │ │ + GGUF model │ │ │ └──────────────┘ └────────────────┘ │ IBM POWER Hardware └──────────────────────────────────────────┘

We've been building and shipping AIX packages through LibrePower's AIX repository for years — over 30 open-source packages installable via DNF. When llama.cpp joined the catalogue, testing the jump to IBM i was the natural next step. PASE handles the rest.

For IBM i administrators

You don't need to install anything special on the operating system. PASE is already active. All you need is the XCOFF binary of llama.cpp and a big-endian GGUF model. The LLM runs as a regular PASE process, without touching the native IBM i environment.

The technical hurdle

The big-endian problem (and how we solved it)

There's a reason nobody had done this cleanly before: byte order. IBM i and AIX are big-endian. Virtually all AI software — x86, ARM, Linux ppc64le — assumes little-endian. A GGUF file downloaded from HuggingFace won't load on IBM i: the bytes are in the wrong order.

We'd already solved this in our AIX work. The solution: convert the models before distributing them. We publish big-endian GGUF models at huggingface.co/librepowerai, validated on real AIX hardware and ready to load directly on IBM i PASE.

ModelSizeQuantization
TinyLlama 1.1B Chat668 MBQ4_K_M
LFM 1.2B Instruct695 MBQ4_K_M
LFM 1.2B Thinking731 MBQ4_K_M
7 more available

These are the same models that reach 10–12 tok/s on AIX POWER. On IBM i POWER10 — with MMA hardware acceleration active via OpenBLAS — performance should be comparable or better. Concrete IBM i benchmarks are in progress.

From PoC to production

From proof of concept to production

Running --help proves the binary loads. The real path to useful AI in your applications has three stages, and the first one is available right now.

Stage 1: Direct inference (available now)

From any SSH or QSH session on the IBM i:

# Direct inference from the command line
LIBPATH=/path/to/llama /path/to/llama/llama-simple \
    -m /path/to/model.gguf \
    -p "Summarize this purchase order" -n 200 -t 8

Useful for CL scripts, batch jobs, or just verifying that the model loads and responds correctly on your specific hardware before going further.

Stage 2: OpenAI-compatible API server (coming soon)

llama.cpp includes llama-server, which exposes an HTTP endpoint compatible with the OpenAI API. Once running in PASE, any RPG program can call it using QSYS2.HTTP_POST — exactly like any other API:

# Start the inference server on IBM i via PASE
LIBPATH=/path/to/llama /path/to/llama/llama-server \
    -m /path/to/model.gguf \
    --host 0.0.0.0 --port 8080 -t 8
// Call it from RPG — the LLM is on localhost
dcl-s url varchar(256) inz('http://localhost:8080/v1/chat/completions');
dcl-s body varchar(65535);
dcl-s response varchar(65535);
// QSYS2.HTTP_POST — no data leaves IBM i

The important part: localhost. The model is on the same machine. Data never leaves the partition.

Stage 3: Business application integration (in development)

  • Document analysis: feed Db2 reports to the LLM for automatic summarization
  • Natural language queries: the user types in plain English, the LLM returns SQL
  • RPG code modernization: the LLM analyzes and documents existing programs without leaving IBM i
  • Intelligent monitoring: analyze QSYSOPR messages and job logs with semantic context
A note on performance: small models (1–2B parameters) running in PASE are more than enough for classification, summarization, structured data extraction, and fixed-format responses. For longer text generation or complex reasoning, 7B+ models scale well with more threads. IBM i POWER10 benchmarks are in progress.
Hands-on

How to try it yourself

If you have access to an IBM i with PASE active, it's three steps.

1. Get the llama.cpp binary for AIX

Available on LibrePower's GitLab. If you have DNF/yum configured:

# From AIX (or via PASE if you have dnf)
dnf install llama-cpp

2. Download a big-endian model

curl -L -o tinyllama-be.gguf \
  "https://huggingface.co/librepowerai/TinyLlama-1.1B-Chat-v1.0-GGUF-big-endian/resolve/main/tinyllama-1.1b-q4_k_m-be.gguf"

TinyLlama is a solid starting point: 668 MB, fast to load, and enough to verify everything works before moving to larger models.

3. Run inference

LIBPATH=/path/to/llama ./llama-simple \
    -m tinyllama-be.gguf \
    -p "What is IBM i?" \
    -n 150 -t 4
IBM i in production?

SIXE has been supporting IBM i environments for years. If you want to understand whether this approach fits your architecture — or what it means for your RPG applications — get in touch. No strings attached.

Roadmap

What's next

This is a solid proof of concept, not a finished product. Here's what we're working on next:

  • llama-server on IBM i — the HTTP API server running in PASE, documented and packaged so you can get it running in minutes
  • RPG integration examples — real code for calling the LLM from RPG programs via QSYS2.HTTP_POST
  • IBM i POWER10/POWER11 benchmarks — real tok/s measurements with PASE on production hardware
  • Larger models — testing 7B+ models on partitions with enough memory
  • vLLM for IBM i — our vLLM package for ppc64le, adapted to run in PASE

More from LibrePower

ProjectWhat it does
llama-aixllama.cpp for AIX with 10 big-endian GGUF models ready to download
linux.librepower.orgAPT repository with vLLM for Linux ppc64le (Ubuntu/Debian)
aix.librepower.org30+ open-source packages for AIX, installable via DNF

Got IBM i with PASE?

Try the LLM on your own partition

The binary is on GitLab. The models are on HuggingFace. If you have PASE access and a few minutes, you can replicate exactly what we describe here :)

We just made LLM inference a sudo apt install on IBM POWER


LibrePower · Linux on Power · March 2026

vLLM on IBM POWER: LLM inference without a GPU

The first pre-built vLLM package for Linux ppc64le. Built by the community — and it runs on hardware you already own.

March 202618 min read


If you run IBM POWER systems, you know the drill. The hardware is extraordinary — POWER9, POWER10 and POWER11 deliver unmatched RAS, memory bandwidth, and per-core throughput. But when it comes to the AI/ML ecosystem, you’ve historically had two options: bring your own GPUs (usually x86), or go through Red Hat OpenShift AI. Today, there’s a third option for vLLM on IBM POWER. One that takes about 30 seconds, works on hardware you already own, and uses MMA hardware acceleration automatically.

The package

What we built: vLLM on IBM POWER as a .deb package

vLLM is the most popular open-source LLM serving engine. It powers inference at companies running millions of requests per day. It supports the full OpenAI API: /v1/chat/completions, /v1/completions, /v1/models — streaming, function calling, tool use, the works.

The problem? There were no pre-built packages for ppc64le. Not on PyPI. Not in Ubuntu’s repos. If you wanted vLLM on IBM POWER, you were on your own — figuring out build dependencies, patching C++ extensions, compiling PyTorch from source. IBM’s own community has documented how complex the manual setup is.


So we compiled it. On real IBM POWER hardware. Optimized for the architecture. And packaged it as a .deb that APT can install with full dependency resolution.

$ apt-cache show python3-vllm

Package: python3-vllm
Version: 0.9.2-1
Architecture: ppc64el
Maintainer: LibrePower <packages@librepower.org>
Depends: python3 (>= 3.10), python3-numpy, python3-requests
Homepage: https://linux.librepower.org
Description: OpenAI-compatible LLM inference server for ppc64le
Ubuntu on IBM POWER
Running Ubuntu on IBM POWER is a key enabler for this workflow. SIXE deploys and supports Ubuntu ppc64le environments with Canonical partnership — the same infrastructure that makes this apt-based installation possible.

Under the hood

The journey: from source to package

Getting vLLM to run on POWER is not a trivial pip install. Here’s what was involved.

PyTorch on POWER

vLLM depends on PyTorch, which is not distributed for ppc64le via PyPI. IBM publishes wheels at wheels.developerfirst.ibm.com — we leverage those as the base. See the full list of IBM-supported developer tools for POWER.

The C extension

vLLM’s performance-critical path is a C++ extension (_C.abi3.so) that handles attention, caching, activation functions, and quantization. This needs to be compiled from source with CMake, linking against PyTorch’s C++ API and oneDNN for optimized GEMM operations.

-- PowerPC detected
-- CPU extension compile flags: -fopenmp -DVLLM_CPU_EXTENSION
   -mvsx -mcpu=power9 -mtune=power9
-- CPU extension source files: csrc/cpu/quant.cpp csrc/cpu/activation.cpp
   csrc/cpu/attention.cpp csrc/cpu/cache.cpp csrc/cpu/utils.cpp
   csrc/cpu/layernorm.cpp csrc/cpu/pos_encoding.cpp
[100%] Linking CXX shared module _C.abi3.so
[100%] Built target _C

The resulting binary includes oneDNN with PPC64 GEMM kernels — the same math library that Intel uses for x86, but targeting POWER’s vector units.

Dependency resolution

The Python ecosystem on ppc64le has gaps. Some packages have pre-built wheels, others need compilation from source, and a few have version conflicts. We resolved all of this so you don’t have to.

In practice

Running LLM inference on IBM POWER: code and output

Here’s what it looks like in practice. First, install the package:

# Add the LibrePower APT repository
curl -fsSL https://linux.librepower.org/install.sh | sudo sh

# Install vLLM for ppc64le
sudo apt update
sudo apt install python3-vllm

# Install PyTorch wheels from IBM
pip3 install torch --extra-index-url \
  https://wheels.developerfirst.ibm.com/ppc64le/linux

Then run inference from Python:

# Python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    dtype="bfloat16",
    device="cpu",
    enforce_eager=True
)

output = llm.generate(
    ["Explain quantum computing in simple terms."],
    SamplingParams(temperature=0, max_tokens=100)
)

print(output[0].outputs[0].text)

But vLLM’s real value is the OpenAI-compatible server mode — this is what makes it useful for production:

# Start the OpenAI-compatible inference server
python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --device cpu --dtype bfloat16 --port 8000
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [{"role": "user", "content": "What is IBM POWER?"}],
    "max_tokens": 100
  }'

LangChain, LlamaIndex, Open WebUI, Continue.dev — anything that can point to an OpenAI endpoint works out of the box. Change base_url to your POWER server and you’re done. This is what makes CPU inference on IBM POWER a realistic path to private, self-hosted LLM deployment.

The numbers

Performance: real numbers on POWER9, POWER10 and POWER11

We benchmarked on both POWER generations with Qwen2.5-0.5B-Instruct (494M parameters, BF16). These are not theoretical numbers — they come from running the benchmark tool on actual hardware.

POWER9

$ OMP_NUM_THREADS=12 python3 bench_vllm.py
Run 1: 17.8 tok/s (100 tokens in 5.6s)
Run 2: 16.7 tok/s (100 tokens in 6.0s)
Run 3: 18.5 tok/s (100 tokens in 5.4s)
BENCH P9: threads=12 avg=17.6 tok/s

12 threads is optimal — more threads add cache contention on this memory-bandwidth-bound workload.

POWER10

$ OMP_NUM_THREADS=1 python3 bench_vllm.py
Run 1: 13.9 tok/s (100 tokens in 7.2s)
BENCH P10: threads=1 avg=13.9 tok/s
13.9 tok/s from a single POWER10 core. For context, the POWER9 result uses 12 threads across multiple cores to achieve 17.6 tok/s. The per-core efficiency improvement from POWER9 to POWER10 is dramatic, driven by MMA hardware acceleration. POWER11 shares the same MMA architecture with further enhancements.
SystemThreadstok/sPer-core efficiency
POWER10/11113.913.9 tok/s/core
POWER91217.61.5 tok/s/core

This isn’t competing with an A100 — it’s filling a completely different gap: running LLM inference on IBM POWER infrastructure you already own. No GPU budget, no PCIe slots, no driver headaches. For organizations with existing POWER9, POWER10 or POWER11 servers, this is a zero-capital path to private AI.

We also tested Qwen2.5-7B-Instruct (7 billion parameters) on a single POWER10 core — it loaded and ran at 1.0 tok/s. Not fast enough for interactive use on one core, but proof that larger models work. With more cores, this scales linearly. Those running IBM POWER training courses through SIXE often ask about AI workloads on existing hardware — these numbers are the answer.

Inside the machine

What actually happens when a POWER10/11 runs an LLM

If you’ve seen IBM’s presentations about AI on POWER, you’ve probably encountered terms like MMA, Spyre, oneDNN, and OpenShift AI. They’re often shown together on the same slide. But what do they actually mean? And which ones are active when you run python3 -m vllm?

We went deep into the software stack to answer this. The findings surprised us.

A quick glossary (no jargon left behind)

  • LLM (large language model) — Software that generates text — ChatGPT, Llama, Qwen. A mathematical model with billions of numbers that predicts the next word.
  • Inference — Running a trained model to get answers. Training teaches the model; inference uses it. This article is entirely about inference.
  • Token — A word or piece of a word. “17.6 tokens per second” means roughly 17–18 words per second.
  • BF16 (bfloat16) — A way to store numbers using 16 bits instead of 32. Half the memory, nearly the same precision. Think: “good enough quality at half the storage cost.”
  • GEMM (general matrix multiply) — The core math operation in neural networks. Most compute time in LLM inference is spent multiplying large matrices.
  • MMA (matrix-multiply accumulate) — Special-purpose circuitry inside POWER10 and POWER11 designed to accelerate matrix math. Like a dedicated calculator for the one specific operation that dominates LLM inference.
  • OpenBLAS — An open-source math library with optimized GEMM routines. The engine that does the actual matrix multiplication on POWER.
  • oneDNN — Intel’s math library, also compiled into vLLM. Another engine for the same purpose.
  • PyTorch — The framework that runs the neural network. It calls OpenBLAS or oneDNN for the heavy math.

How the pieces fit together

When vLLM generates a token, here’s the exact path through the machine:

You type a question

vLLM receives it and breaks it into tokens

PyTorch runs the neural network math

For each layer: multiply large matrices (GEMM)

PyTorch asks OpenBLAS: “multiply these two BF16 matrices”

OpenBLAS runs sbgemm_kernel_power10 ← THIS USES MMA

POWER10/11 hardware executes MMA instructions

Result flows back up, next token is chosen

You see the next word appear
MMA acceleration is already active in our benchmarks. It’s not a future feature or a configuration flag — it works right now, through the path PyTorch → OpenBLAS → MMA hardware. No special setup required.

Proving it: BF16 vs FP32 on POWER10/11

On POWER10 and POWER11, MMA accelerates BF16 math. On POWER9 (no MMA), BF16 is actually slower than FP32 due to software emulation. If MMA is working, BF16 should be faster:

# Raw matrix multiplication benchmark (1024×1024) on POWER10
BF16: 384.4 GFLOPS  (5.6 ms)
FP32: 249.6 GFLOPS  (8.6 ms)
BF16/FP32 ratio: 1.54x

BF16 is 1.54× faster than FP32. MMA is active and delivering measurable hardware acceleration. Our 13.9 tok/s on a single POWER10 core already includes MMA. That’s the real, hardware-accelerated number. The power of POWER10 and POWER11’s AI acceleration capabilities is something we cover in depth in our Linux on IBM POWER Systems training courses.

The oneDNN investigation (and what we learned)

We initially thought there might be hidden performance left on the table.

The vLLM build bundles oneDNN (originally from Intel). Inside, there are two POWER-specific math paths:

  • int8 GEMM: A hand-written kernel by IBM engineers using MMA instructions for quantized models.
  • BF16 GEMM: A passthrough to OpenBLAS — but only when compiled with specific flags.

Our initial build didn’t have those flags. We recompiled with -DDNNL_BLAS_VENDOR=OPENBLAS, confirmed the flags were active, benchmarked again — same performance.

Why? PyTorch was already going directly to OpenBLAS, bypassing oneDNN for the main matrix operations. The optimization was already there; we just didn’t know it.

Practical takeaway: You don’t need to configure anything special. PyTorch on POWER10 and POWER11 with OpenBLAS automatically uses MMA for BF16 inference. Install the package and run.

What about IBM Spyre?

IBM Spyre is a dedicated AI accelerator card for POWER — a completely separate piece of hardware with its own silicon for AI math. Think of it this way:

  • MMA = built-in acceleration inside every POWER10 and POWER11 core (active right now in our benchmarks)
  • Spyre = a separate AI accelerator card you add to the system (promising, but requires specific IBM software stacks)

Our work focuses on what’s available today using the CPU already in your machine, with zero additional hardware investment.

The complete picture

TechnologyWhat it is (plain English)Active in our build?
POWER10/11 MMA (BF16)Built-in matrix accelerator in the CPUYes — PyTorch → OpenBLAS
POWER10/11 MMA (int8)Same hardware, for 8-bit quantized modelsBuilt, not end-to-end yet
IBM SpyreSeparate AI accelerator cardNo — different hardware
OpenShift AIFull ML platform on KubernetesNo — we’re the lightweight path
oneDNNMath library bundled with vLLMCompiled in, bypassed by PyTorch
OpenBLASMath library with hand-tuned POWER10/11 kernelsYes — the real workhorse

Context

The bigger picture: LLM inference on IBM POWER without OpenShift

Red Hat OpenShift AI

Until now, the official IBM/Red Hat play for LLM inference on IBM POWER was OpenShift AI. It supports notebooks, pipelines, model training, serving, and monitoring. As of version 3.0, it runs on ppc64le with CPU-only workloads.

OpenShift AI is the right choice for organizations that already have OpenShift clusters. It comes with RBAC, InstructLab for model fine-tuning, and enterprise support.

But it requires OpenShift. A Kubernetes cluster, a Red Hat subscription, operator management. For many POWER shops — especially those running standalone Linux or mixed AIX/Linux — that’s a significant commitment just to serve a model. Organizations managing these environments often rely on SIXE’s IBM POWER maintenance and support services to keep them running.

What LibrePower adds

We’re not replacing OpenShift AI. We’re complementing it with a lighter path for the many POWER sites that don’t need the full platform.

OpenShift AILibrePower vLLM
InstallOpenShift cluster + operatorsapt install python3-vllm
InfrastructureKubernetes requiredAny Ubuntu/Debian ppc64le
ScopeFull ML lifecycleInference serving only
SupportRed Hat subscriptionCommunity (open source)
GPUSupported (x86)CPU-only (POWER native)
Time to first inferenceHours to daysMinutes
CostOpenShift licensingFree

IBM builds the highway — world-class hardware, PyTorch wheels, OpenShift AI, InstructLab. LibrePower adds an on-ramp for people who don’t need the full platform. Both are needed. IBM’s roadmap for AI on IBM POWER is moving fast, and community tooling like this fills real gaps in the ecosystem today.

The infrastructure

How the LibrePower package repository works

We built linux.librepower.org following the same pattern as our AIX package repository — infrastructure that already serves 30+ open-source packages to AIX systems worldwide.

linux.librepower.org/
  dists/jammy/
    InRelease          (GPG signed)
    Release
    main/binary-ppc64el/
      Packages
  pool/main/
    python3-vllm_0.9.2-1_ppc64el.deb
  install.sh

CI/CD runs on GitLab: every push regenerates APT metadata and deploys automatically. All packages compiled on real IBM POWER hardware — not cross-compiled, not emulated. The full source is on GitLab under Apache 2.0.

Roadmap

What’s next for vLLM on IBM POWER

  • More models tested — Llama, Mistral, Phi, Granite. Systematic benchmarks across model families.
  • llama.cpp for ppc64le — GGUF quantized models for even lower memory footprint. Already shipping for AIX.
  • Ubuntu 24.04 and Debian 12 support — Extending the package to the latest LTS releases.
  • POWER10/11-optimized variants — Going deeper into MMA tuning. Our current 13.9 tok/s per core is a starting point, not a ceiling.
  • int8 GEMM end-to-end — Completing the MMA path for quantized models, which should improve throughput further.
Want to run AI workloads on your IBM POWER infrastructure?
SIXE helps organizations deploy and operate Linux on IBM POWER — from official IBM Linux on Power training to full infrastructure support. If you’re evaluating LLM inference on existing POWER hardware, talk to us.

Got a ppc64le system?

Try vLLM on IBM POWER now

If you have a system running Ubuntu, it’s three commands. Source is on GitLab if you want to dig in or contribute. IBM POWER training and infrastructure support by SIXE.

# Add the LibrePower repository
curl -fsSL https://linux.librepower.org/install.sh | sudo sh

# Install vLLM for ppc64le
sudo apt update && sudo apt install python3-vllm

# Install PyTorch (IBM wheels)
pip3 install torch --extra-index-url \
  https://wheels.developerfirst.ibm.com/ppc64le/linux

# Run the OpenAI-compatible inference server
python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --device cpu --dtype bfloat16 --port 8000

What Is an AI Factory and How to Build One with Open Source


AI Infrastructure · March 2026

What is an AI Factory and how to build one with open source in your own data centre

The AI Factory concept has been everywhere for two years — but few organisations actually understand what it takes to build one, or how to do it without being locked into a cloud provider. Here’s a straight-talking breakdown, with the specific stack we use in production.

March 202620 min read

An AI factory is not a server with a GPU and a model downloaded from Hugging Face. It is a distributed compute infrastructure designed to run language and vision models in production continuously, at scale, and under full organisational control. The good news: building one is no longer the exclusive privilege of hyperscalers. The open source technology powering the Barcelona Supercomputing Center’s AI Factory, and the sovereign AI infrastructure programmes across Europe, is available to any organisation with its own data centre. What follows is a practical guide to what you need, what you don’t, and how to decide whether it makes sense for you.

Just a bit of context

What exactly is an AI factory?

The term “AI Factory” was popularised by NVIDIA’s Jensen Huang in 2023 to describe what data centres are becoming: machines that produce intelligence continuously, the way a factory produces goods. The metaphor isn’t poetic — it’s technically precise.

A classic AI factory has four distinct components: a storage system for model weights and datasets (which run to tens or hundreds of gigabytes), a GPU compute layer for running inference, an orchestrator managing which model runs on which hardware, and an API that exposes models to the rest of the organisation. When those four components work efficiently together, you have an AI factory.

What differentiates it from “having an LLM running on a server” is scale, reliability, and management. An AI factory serves multiple models in parallel, handles request queuing, guarantees availability, and monitors resource usage. It’s production infrastructure — not a test environment.

Worth knowing

The European Commission has committed over €1.5 billion to building AI Factories distributed across member states under the EuroHPC programme. The explicit goal is for Europe to have sovereign AI infrastructure that doesn’t depend on US or Asian providers. Spain participates through the BSC. The same technology stack they use can be deployed in your data centre.

Why bring AI inference in-house?

Why organisations are moving their AI workloads on-premises

Three arguments come up in every conversation we have with clients evaluating on-premise AI infrastructure. These aren’t marketing talking points — they’re operational and financial realities.

💸

Predictable costs

GPU bills in public cloud can swing 30–40% between billing cycles depending on demand. With your own infrastructure, the cost is fixed, depreciable, and entirely predictable. At medium inference volumes, the investment typically pays back in 12–18 months compared to cloud.

🔓

Zero vendor lock-in

Proprietary APIs, closed formats, fine-tuned models living in someone else’s infrastructure. With an open source stack, your models and your data are yours — always portable, no exit negotiations, no lock-in contracts.

🏛️

Regulatory compliance

GDPR and the EU AI Act require knowing precisely where data is processed. If your inference touches patient data, citizen records, or financial data, you need full control over the infrastructure. On-premises is the only architecturally sound answer.

The question is no longer whether to build your own AI infrastructure, but when and how to do it without repeating the mistakes of the cloud rush a decade ago: velocity without architecture.
— SIXE Technical Team

That said, on-premise AI infrastructure isn’t right for everyone. If you’re running ten inference requests a day and have no strict regulatory requirements, cloud is probably the right answer right now. On-premises starts making sense when volumes are sustained, when data is sensitive, or when you need to run proprietary fine-tuned models without exposing weights to third parties.

OK so… how do I actually build this?

The open source stack: three technologies, zero proprietary dependencies

A combination of three technologies has emerged as the de facto standard for building on-premise AI factories in European enterprise environments. The same stack the BSC uses. The same technologies driving sovereign AI infrastructure in France, Germany, and Italy. And the same stack we deploy at SIXE.

Ceph: distributed storage built for AI workloads

Language models are heavy. Llama 3 70B occupies roughly 40 GB in float16 precision. Mixtral 8x7B sits around 90 GB. A reasonable model catalogue for a mid-sized organisation can easily exceed 500 GB — before accounting for fine-tuning datasets or inference logs.

Ceph solves this with distributed storage that unifies object storage (natively S3-compatible), block storage, and filesystem in a single cluster. It scales from terabytes to petabytes without interruption, supports erasure coding for storage efficiency, and has native Kubernetes integration via CSI. In an AI factory, Ceph acts as the backbone where model weights, datasets, and inference results live.

SIXE Perspective

We are a Canonical Partner and have been deploying Ceph clusters in production for years, including AI and HPC environments. Ceph is not “tick a checkbox”: it requires careful sizing, network design, and replication policies adapted to your workload. Three-node clusters have quorum considerations that shouldn’t be improvised. We offer dedicated training and support so your team can operate Ceph with real autonomy — not permanent consulting dependency.

OpenStack: your private cloud with native GPU management

OpenStack turns your hardware into a private enterprise cloud. For an AI factory, its primary role is GPU resource management: PCI passthrough for direct GPU access from VMs, vGPU for sharing a physical GPU across multiple workloads, and NVIDIA MIG (Multi-Instance GPU) for partitioning A100 and H100 GPUs into independent instances.

Under the Linux Foundation since 2024, OpenStack runs in production across more than 45 million cores at organisations including Walmart, GEICO, and LINE Corp. This isn’t emerging technology — it’s proven infrastructure at real scale, with independent governance that guarantees continuity.

Worth noting

OpenStack is not trivial. It spans more than 30 service projects and requires teams with experience in distributed systems. If your team comes from a VMware background, the learning curve is real. Our training service covers practical, hands-on upskilling so your team can operate the stack independently — without long-term consulting dependency.

Kubernetes + vLLM: the inference orchestration layer

Kubernetes is the CNCF standard for container workload orchestration, with native GPU scheduling via the NVIDIA GPU Operator. The inference engines are deployed on top of Kubernetes — and vLLM is the most significant for language models right now.

vLLM implements PagedAttention, a technique that manages KV cache memory efficiently and enables parallel serving of multiple requests without wasting VRAM. In representative benchmarks, vLLM delivers 3–5x the throughput of a naive implementation of the same model. It exposes an OpenAI-compatible API, which makes migrating applications already consuming GPT-4 or similar models straightforward.

For vision or embedding models, NVIDIA’s Triton Inference Server complements vLLM and enables hardware-specific optimisations such as TensorRT-LLM.

How does an AI factory take shape?

Reference architecture: from data to model in production

An on-premise AI factory with this stack follows a four-layer flow. It’s not the only possible design, but it’s the one that best balances operational complexity, performance, and portability.

01 — Data

Ceph S3

Models, datasets, inference results. S3-compatible API for integration with MLOps pipelines.

02 — Compute

OpenStack

GPU scheduling, bare metal, isolated project networks. PCI passthrough and MIG for maximum efficiency.

03 — Orchestration

Kubernetes

GPU Operator, inference pod autoscaling, deployment lifecycle management.

04 — Production

vLLM / Triton

Inference APIs, RAG, agents. OpenAI-compatible endpoints for friction-free integration.

The key to this design is that each layer is independent and replaceable. If a better orchestrator than Kubernetes emerges for AI workloads tomorrow, you can swap it out without touching storage or the compute layer. That’s what zero vendor lock-in really means: not just open source software, but genuine separation of concerns in the architecture.

Component
Role in the factory
Viable alternatives
Governance

Ceph
Model and data storage
IBM Storage Scale (GPFS)
Linux Foundation

OpenStack
Private cloud with GPU management
MaaS + direct bare metal
OpenInfra / LF

Kubernetes
Container orchestration
MicroK8s, OpenShift
CNCF / LF

vLLM
LLM inference engine
Triton, TensorRT-LLM
Apache 2.0

Ubuntu / Canonical
Base OS + stack support
RHEL, SUSE
Canonical Partner

Is this right for my organisation?

Who actually needs an on-premise AI factory

Not every sector has the same urgency or the same constraints. In four areas, on-premise AI infrastructure isn’t a preference — it’s the only architecturally viable answer.

🏥

Healthcare & pharma

Clinical records, diagnostic imaging, genomic data. GDPR and the EU Health Data Space Regulation impose strict restrictions on transfers to third countries without explicit safeguards. On-premise inference is the default compliance architecture.

🏦

Banking & insurance

Credit scoring, fraud detection, risk analysis. EBA guidelines on AI in financial services and the EU AI Act classify these systems as high-risk, with traceability and control requirements that only on-premises architectures can meet.

🏛️

Public sector & defence

Technological sovereignty, NIS2, classified data. The EU’s AI strategy explicitly requires that public-facing AI systems operate on European or national infrastructure. No discussion needed.

🏭

Industry & manufacturing

Computer vision on production lines, predictive maintenance, quality control. Cloud latency is not viable when you need millisecond response times on the factory floor. Edge or on-premises inference is the only workable model.

FAQ

Questions to answer before you start

Building an on-premise AI factory is not a weekend project. It requires honest prior analysis across four dimensions that determine whether it makes sense and how to execute it well.

Which models are you serving, and at what request volumes?

GPU sizing depends directly on model size (parameter count and precision) and throughput targets (requests per second, acceptable P99 latency). A 7B parameter model in float16 fits in a single L40S GPU with 48 GB of VRAM. A 70B model requires multiple GPUs with tensor parallelism. There are no shortcuts here: correct sizing requires knowing real workloads, not optimistic estimates.

Does your team have the capacity to operate this stack?

The most important question — and the one asked least often. A team with experience in Linux, Kubernetes and distributed systems can learn to operate this stack. But if you’re starting from scratch, the learning curve needs to be inside the plan, not outside it. SIXE offers certified training in Ceph, OpenStack and Kubernetes (as an IBM Business Partner and Canonical Partner) precisely so the transition doesn’t create permanent consulting dependency.

What is the real 3-year TCO?

The software is open source, so there are no licence costs. The investment is hardware (GPUs, servers, high-speed networking) plus team upskilling. Compared to cloud GPU costs at the same inference volume over that period, the numbers tend to speak for themselves. But the financial model must include maintenance, updates, and staff operating time. Nothing is free — and projects that start from that assumption tend to hit unpleasant surprises.

How we work at SIXE

Before any deployment, we carry out an architecture assessment: we audit your real workloads, latency requirements, data volumes, and regulatory obligations. We deliver a complete design — GPU sizing, network topology, Ceph storage layout, and a 12–24 month capacity plan. No smoke, no savings promises we haven’t calculated. Just a technical analysis of whether it makes sense, and how to execute it.


Have an AI inference project in mind?

Your AI factory, built with the stack we use ourselves

IBM Business Partner and Canonical Partner. Over 15 years deploying open source in production. We design the architecture, train your team, and support you until the infrastructure runs on its own. Our work ends when yours truly begins.

10 Architecture & ROI Mistakes in the Post-VMware Exodus | SIXE

Technical Analysis · February 2026

10 Architecture & ROI Mistakes Nobody Assessed in the Post‑VMware Exodus

Open source VMware alternatives are viable. But “viable” and “right for your business” are radically different questions — ones that demand technical, financial and governance analysis before you commit your infrastructure for a decade.

February 202625 min readFor CIOs, CTOs & Infrastructure Directors
86% of VMware customers are actively reducing their footprint. Only 4% have completed the migration. Between those two numbers lies a chasm of technical, financial and strategic decisions that most organisations are making with more urgency than rigour. This article doesn’t recommend any platform — it lays out the questions every leadership team should answer before committing their infrastructure for the next 5–10 years. Spoiler: there’s no magic answer, but there is a sensible path forward.

Mistake 01 of 10

Confusing urgency with strategy

Broadcom’s acquisition of VMware in November 2023 for $61 billion triggered the biggest upheaval in enterprise virtualization in over a decade. And “upheaval” is putting it mildly. Perpetual licences eliminated, ~8,000 SKUs consolidated into 4 bundles, a 72-core purchase minimum and reported price increases ranging from 150% to 1,500%. Tesco filed a £100 million lawsuit. Fidelity warned of risks to 50 million customers. Safe to say, the landscape encouraged a stampede.

And that’s exactly what happened: a lot of people have been running — just not always in the right direction. A CloudBolt survey (2026) found that 63% have changed their migration strategy at least twice (yes, twice). Gartner estimates VMware’s market share will fall from 70% to 40% by 2029, but the road there is full of twists that deserve considerably more thought than they’re getting.

SIXE Perspective

The urgency Broadcom has created is entirely understandable — nobody enjoys having their bill multiplied overnight. But every infrastructure decision made under pressure becomes technical debt your teams will inherit for years. The first right decision is to separate the immediate tactical response from the medium-term strategy and evaluate each on its own terms. Breathe, plan, then act.

Mistake 02 of 10

Ignoring open source project governance

Not all open source is created equal — far from it. The most relevant difference for a long-term business decision isn’t the licence, but something almost nobody checks: who controls the project and what mechanisms protect the community if commercial priorities shift.

Proxmox Server Solutions GmbH is a private Austrian company with €35,000 in share capital and an estimated team of 14–24 people. Great people, no doubt, but there’s no independent foundation, no open governance board, and no community representation in development decisions. In other words: the future of your virtualization platform depends on a single company’s choices.

Compare that to the MariaDB Foundation, where no single company can hold more than 25% of board seats — a safeguard that protected the project when MariaDB Corporation was acquired by K1 in September 2024. Or OpenStack, now under the Linux Foundation, with governance distributed across hundreds of organisations. Now that’s a safety net.

Key question

Is your virtualization platform — the one that will run your business applications for the next 10 years — governed by an independent foundation, a consortium, or a single private company with fewer than 25 employees? This isn’t a rhetorical question: the answer has direct implications for long-term vendor lock-in risk.

Mistake 03 of 10

Not reading the Contributor Licence Agreement

We know — reading a CLA isn’t exactly a Friday night plan. But it’s worth it. The Proxmox CLA grants the company a perpetual, worldwide, irrevocable licence over all contributions, with the right to relicence them under commercial or proprietary terms. This mechanism isn’t inherently problematic, but it’s exactly the structural combination (single company + AGPL + permissive CLA) that preceded every major licence change of the past seven years. It’s like watching storm clouds gather and saying “I’m sure it won’t rain.”

ProjectYearChangeConsequenceGovernance
MongoDB2018AGPL → SSPLDropped by Debian/Red HatSingle vendor
Elasticsearch2021Apache 2.0 → SSPLFork: OpenSearch (Linux Foundation)Single vendor
HashiCorp2023MPL 2.0 → BSLFork: OpenTofu · IBM: $6.4BSingle vendor
Redis2024BSD → SSPLFork: Valkey (Linux Foundation)Single vendor
MinIO2021–26Apache → AGPL → AbandonedRepo: “NO LONGER MAINTAINED”Single vendor
KubernetesApache 2.0 stableFoundation (CNCF)
PostgreSQLPostgreSQL Licence stableCommunity
LinuxGPLv2 stableFoundation (LF)

See the pattern? It’s pretty clear: no project backed by an independent foundation has ever suffered a unilateral licence change. Not a single one. This fact should inform any risk assessment, regardless of which platform you’re considering.

Mistake 04 of 10

Assuming subscription costs will stay flat

Every company that develops open source software needs to monetise its work, and that’s entirely fair — nobody works for free. The question isn’t whether prices will go up (they will, as with everything), but whether you’re factoring that into your TCO model.

Proxmox’s Community subscription went from €49.90 to €120/year (~140% increase), and in January 2026, all tiers rose another 3.8–4.3%. The new Proxmox Datacenter Manager requires at least 80% of nodes to carry Basic or higher subscriptions (€370+/socket/year). Sound familiar? It does to us, too.

OpenStack, Ceph and other VMware alternatives also have their own cost structures. No platform is free in production — if anyone tells you otherwise, smile politely and ask for the receipt. The real difference lies in which costs are predictable and which hinge on unilateral decisions.

SIXE Perspective

When we assess alternatives with our clients, we always model three cost scenarios: optimistic, realistic and adverse, with 5- and 10-year projections that factor in potential licensing changes. Yes, it’s more work. But it’s the only way to build a TCO that won’t crumble at the first price shift.

Mistake 05 of 10

Underestimating the operational complexity of OpenStack

Let’s give credit where it’s due: OpenStack runs in production with over 45 million cores at organisations like Walmart, GEICO and LINE Corp. Its governance — now under the Linux Foundation — is among the strongest in the ecosystem. These are genuine, weighty advantages.

But (there’s always a but) the project itself acknowledges that 44% of IT vendors and 39% of enterprises report difficulty finding qualified professionals. The platform comprises over 30 service projects. Deploying and operating that stack requires teams with distributed systems experience that most VMware teams don’t have — not for lack of talent, but because these are fundamentally different skill sets. It’s like asking a brilliant Mediterranean chef to compete in a sushi championship: both are world-class cooking, but the skills don’t transfer automatically.

Important nuance

OpenStack can be the perfect choice for organisations with the right scale and the right teams. Managed offerings (Canonical, Mirantis, Platform9) solve part of the complexity, though they add cost and dependency. The question isn’t “OpenStack yes or no?” but “do our team, budget and scale match what OpenStack needs to thrive?”

Mistake 06 of 10

Assuming Ceph is “just tick the checkbox”

Ceph is powerful and runs in production at impressive places: CERN, Bloomberg and DigitalOcean, among others. But the gap between a hyperscale Ceph cluster and the typical 3–5 node deployment in a VMware migration is like the gap between flying an Airbus and a microlight: both fly, but the rules are very different.

StarWind benchmarks in Proxmox HCI environments showed Ceph reaching 270,000 IOPS in mixed 4K workloads, compared to 1,088,000 IOPS for LINSTOR/DRBD and 1,199,000 for StarWind VSAN. And the risks of small clusters deserve close attention: losing one node in a 3-node cluster with 3x replication can leave the system in read-only mode. Not exactly what you want at 3 AM on a Monday.

Alternatives worth considering include LINSTOR/DRBD with near-native performance, StorPool with sub-100 µs latencies, and IBM Storage Virtualize with proven enterprise integration. Each has its strengths, limitations and expertise requirements.

SIXE Perspective

The storage layer is arguably the most critical and least reversible decision in the entire migration. It’s where you can’t afford to wing it. It must be evaluated with real testing on real workloads — not generic benchmarks or “it works great for someone I follow on LinkedIn.” This is precisely where an integrator with experience across multiple storage platforms adds the most value.

Mistake 07 of 10

Not auditing the enterprise features you’ll lose

VMware has spent over a decade building capabilities that many organisations have baked into their operational workflows. They’re the sort of things that “just work” — until they’re gone. And when they disappear, the impact goes well beyond technology.

⚖️

Automatic load balancing (DRS)

Proxmox has no native equivalent: you’ll need custom scripting. OpenStack offers partial functionality via Nova scheduling. Either way, prepare to roll up your sleeves.

🛡️

Fault Tolerance & DR

VMware FT/SRM deliver automatic failover. Open source alternatives require custom orchestration with ZFS replication, Ansible and manual runbooks. It works, but someone has to build and maintain it.

🌐

SDN & microsegmentation

Proxmox SDN supports VLANs/VXLANs/EVPN, but IPAM/DHCP are in “tech preview” (read: not quite ready for prime time). There’s no equivalent to the NSX distributed firewall.

📋

Vendor certifications

SAP, Oracle and Microsoft don’t certify Proxmox. NVIDIA AI Enterprise isn’t officially supported either. If you depend on these certifications, this is a detail you don’t want to overlook.

🔧

Automation & API

The Terraform providers for Proxmox are community-maintained. The API requires manually specifying the target node. None of this is a deal-breaker, but these frictions add up.

📞

24/7 support

Proxmox operates on Austrian business hours (7:00–17:00 CET), with no 24/7 option at any tier. When production goes down at 3 AM, there’s literally nobody to call. And no, Google doesn’t count.

None of these limitations invalidate the platform — let’s be clear about that. But each one represents a gap you’ll need to fill with engineering, tooling or consultancy, and every workaround carries a cost that belongs in your financial model. Pretending they don’t exist is a recipe for unpleasant surprises.

Mistake 08 of 10

Calculating ROI on the licensing line alone

The licence savings are real. But presenting those savings as the project’s ROI is like valuing a house move solely by the cost of the van: technically correct, practically incomplete.

Gartner estimates migration services cost between $300 and $3,000 per VM. 44% of organisations experience unplanned downtime during migration. And projects estimated at 6 months routinely turn into 24 — a pattern that’s by now very well documented.

The hidden costs that erode your ROI

Training: $5,000–15,000/engineer + 3–6 months of reduced productivity (your team learns fast, but not overnight). Integration: backup, monitoring, CMDB, automation — mature connectors for VMware that simply don’t exist for Proxmox. Custom development: load balancing scripts, DR, monitoring = internal technical debt. Turnover: when the engineer who built them leaves, the knowledge walks out the door. And it always happens at the worst possible time.

Mistake 09 of 10

Designing for Day 1 and ignoring Day 2

The migration is just the beginning — the honeymoon, if you like. The real cost shows up in day-to-day operations, year after year.

Cybernews found that many organisations that migrated to Proxmox aren’t keeping up with security updates. It’s understandable: when all the effort goes into migrating, it’s easy to forget that afterwards you still have to operate. At scale, the UI becomes unresponsive with several thousand VMs, and pmxcfs hits its limits around 11,000 VMs.

🔐

Security & compliance

CVE management, hardening, audits (ISO 27001, PCI DSS, SOC 2). What SLA for critical vulnerability response does each platform offer? An awkward question, but a necessary one.

📊

Telemetry & observability

Open source alternatives (Prometheus, Grafana, Zabbix) are excellent — they truly are — but they require dedicated integration and maintenance. They don’t configure themselves.

💾

Backup & recovery

Proxmox Backup Server is functional, but it plays in a different ecosystem league compared to Veeam, Commvault or IBM Spectrum Protect.

🏗️

Technical debt

Every custom script is code that needs maintaining, documenting, testing and transferring. Technical debt is invisible until it isn’t — and then it’s the only thing you can see.

Mistake 10 of 10

Thinking that technology is the decision

Migrating away from VMware isn’t a technology project: it’s an operational transformation that touches people, processes, vendors, budgets and risk. Technology matters, of course. But it’s just one piece of the puzzle.

The VMware migration isn’t a technology problem. It’s an operational decoupling problem disguised as a technology problem.— Keith Townsend, The CTO Advisor

The questions that actually matter: what level of governance risk is acceptable to us? Do we have the team we need — or can we upskill in time? What’s our real TCO at 5 and 10 years? How do we protect our investment if the open source project changes course? Which workloads do we migrate first, and which ones should perhaps never move? And who will be our technical partner when things don’t go as planned? (Because at some point, they won’t. That’s life.)

Our conviction

Open source is, without question, the right answer for the infrastructure of the future. We have zero doubt about that. The question isn’t whether to migrate, but how to do it with the rigour the decision deserves. The difference between a successful migration and one that generates years of headaches comes down to the quality of the upfront analysis, the architecture chosen and expert support throughout the process. And hey, if after reading all of this you’d like to talk, we’re right here.


The best migration is one made with judgement, not haste

Over 15 years designing, deploying and operating mission-critical infrastructure. We know VMware, Proxmox, OpenStack, Ceph, IBM Storage and the alternatives because we work with all of them. No favourites — just the solution that fits your business.

Running Liquid AI’s New Model on IBM AIX (No GPU Required)

Forget the H100 clusters for a moment. At SIXE, we decided to push enterprise hardware to its absolute limits to answer a burning question: Can a 2018-era IBM Power System, running AIX and relying purely on CPU, handle the latest generation of AI models?

We took Liquid AI’s new LFM2.5-1.2B model and ran it on an IBM POWER9 processor. To our knowledge, this is the first time an LFM2.5 model has ever run on AIX in Big-Endian mode.

The Result?

Nearly 27 tokens per second, coherent responses, and under 750 MB of memory usage. No GPU. No NPU. Just raw Power architecture muscle.

But raw speed is only half the story. To prove this isn’t just a benchmark toy, we put LFM2.5 through a “SysAdmin Gauntlet”—real AIX administrative tasks—and compared it against a standard Transformer (TinyLlama 1.1B). The results were shocking.

The “Secret Sauce”: What is LFM2.5?

LFM2.5 is a hybrid architecture designed for extreme efficiency, mixing Convolutional blocks (shortconv) for speed and Attention layers (GQA) for context. It features a massive 128k context window—enough to read thousands of lines of logs without forgetting the beginning.

The Hardware: IBM Power System S924

We used the workhorse of the enterprise world. Below are the specific configurations used for this benchmark:

SpecValue
ServerIBM Power System S924
ProcessorIBM POWER9 @ 2.75 GHz
ConfigurationEnterprise Multi-Core (SMT-8 capable)
OSAIX 7.3 TL4
ArchitectureBig-Endian

Performance: Finding the Sweet Spot

We found that throwing every core at the model actually hurts performance due to synchronization overhead. The verdict was clear: Using just 8 cores in SMT-2 mode (16 threads) gave us 26.7 tokens per second.


Real-World Test: The SysAdmin Gauntlet

Numbers are nice, but can it actually work? We ran the exact same prompts through both LFM2.5-1.2B and TinyLlama 1.1B (a pure transformer of similar size). Here are the raw outputs.

Round 1: The Cryptic Error (errpt)

We fed the models a raw AIX error log regarding a power supply failure.

The Input:

$ errpt -a -l BFE4C025

Description: Power/Cooling subsystem Unrecovered Error,
bypassed with loss of redundancy.
Possible FRUs:
Priority: H
FRU: PWRSPLY
Location: U78D2.001.WZS00P4

  • ❌ TinyLlama 1.1B:

    “7. What is the location of the FRU that is affected by this error? The FRU that is affected by this error is PWRSPLY. 8. What is the description of the FRU that is affected by this error? The FRU that is affected by this error is PWRSPLY…”

    Verdict: Fails. It got stuck in an infinite loop repeating “PWRSPLY”.

  • ✅ LFM2.5 1.2B:

    “What is the likely cause? The error usually points to a problem with the power or cooling hardware (like a fan or power supply) that couldn’t be fixed automatically. It might be a hardware failure or a misconfiguration. What action should I take? First, check the hardware — look at the fans (especially the power fans)…”

    Verdict: Pass. It identified the component and gave actionable advice to check the fans.

Round 2: The Security Audit (last)

We provided a login history log containing a hidden crash event.

The Input:

$ last -20

root pts/0 10.10.10.1 Jan 28 11:14 – 18:38 (07:23)
reboot ~ Jan 28 11:12
root pts/0 10.10.10.1 Jan 27 22:09 – System halted abnormally. (13:03)

  • ❌ TinyLlama 1.1B:

    (generated 1 token, then stopped)

    Verdict: Fails. Absolute silence.

  • ✅ LFM2.5 1.2B:

    “Suspicious patterns: The system was halted abnormally on Jan 27. This could indicate a security incident or a misconfiguration. Security recommendations: Investigate the cause of the system halt on Jan 27. Review the login history for any unauthorized access attempts.”

    Verdict: Pass. It immediately spotted the abnormal halt on Jan 27 and recommended an investigation.

Round 3: The Dangerous Advice (/etc/passwd)

We asked the models to audit a standard password file. The results here were shocking.

The Input:

$ cat /etc/passwd

root:!:0:0::/:/usr/bin/bash
daemon:!:1:1::/etc:
bin:!:2:2::/bin:

  • ❌ TinyLlama 1.1B:

    “1. root:!:0:0::/:/usr/bin/bash — This user is not required for system maintenance and should be removed. 2. daemon:!:1:1::/etc: — This user is not required for system maintenance and should be removed.”

    Verdict: CATASTROPHIC FAIL. It claimed the “root” user was not required and recommended deleting it. Following this advice would destroy the server.

  • ✅ LFM2.5 1.2B:

    “Unauthorized or suspicious accounts: You should be very concerned about the presence of the following accounts: guest with a high UID (100), nobody with a high UID (4294967294)… These accounts are either misconfigured, potentially malicious, or not needed.”

    Verdict: Pass. It correctly identified actual potential risks like “guest” and “nobody” accounts with high UIDs.

Round 4: Service Hardening (lssrc -a)

We asked the models to review running services and recommend hardening steps.

The Input:

$ lssrc -a

Subsystem Group Status
sendmail mail active
portmap portmap active
inetd tcpip active
snmpd tcpip active

  • ❌ TinyLlama 1.1B:

    (generated 0 tokens)

    Verdict: Fails. Silence again.

  • ✅ LFM2.5 1.2B:

    “1. sendmail Risk: Running mail daemon can be a target for attacks. Recommendation: Disable sendmail. stopsrc -s sendmail 2. portmap Risk: Running portmap daemon can be abused… stopsrc -s portmap”

    Verdict: Pass. It flagged risky services like sendmail and portmap, and provided the correct AIX command (stopsrc) to disable them.

Why This Matters for IBM Power Users

This benchmark proves that IBM Power Systems are capable AI inference engines for critical, on-premise tasks:

  • Data Sovereignty: Analyze sensitive errpt logs, financial data, or user audits locally. No data leaves your server.
  • Legacy Modernization: Use local LLMs to help understand and document legacy COBOL or C code residing on the server.
  • Efficiency: You don’t need a GPU cluster. You likely already own the hardware capable of doing this.

Try It Yourself

We believe in open source. We have released the AIX port and the converted Big-Endian models.

Code: gitlab.com/librepower/llama-aix
Models: huggingface.co/librepowerai

user@aix:~$ # Quick start on AIX
user@aix:~$ git clone https://gitlab.com/librepower/llama-aix.git
user@aix:~$ ./scripts/build_aix_73.sh

user@aix:~$ # Optimize threading for the "Sweet Spot"
user@aix:~$ smtctl -t 2 -w now

user@aix:~$ # Have fun!
SIXE