Distributed Storage 2026. A (controversial) technical guide

Another year alive, another year of watching the distributed storage industry outdo itself in commercial creativity. If 2024 was the year everyone discovered they needed “storage for AI” (spoiler: it’s the same old storage, but with better marketing), 2025 has been the year MinIO decided to publicly immolate itself while the rest of the ecosystem continues to evolve apace.

Hold on, the curves are coming.

Distributed storage market trends and curves 2026

The Drama of the Year: MinIO Goes Into “Maintenance Mode” (Read: Abandonment Mode)

If you haven’t been following the MinIO soap opera, let me give you some context. MinIO was the open source object storage that everyone was deploying. Simple, fast, S3 compatible. You had it up and running in 15 minutes. It was the WordPress of object storage.

MinIO maintenance mode announcement Reddit screenshot December 2025

Well, in December 2025, a silent commit in the README changed everything: “This project is currently under maintenance and is not accepting new changes.” No announcement. No migration guide. No farewell. Just a commit and goodbye.

The community, predictably, went up in flames. One developer summed it up perfectly: “A silent README update just ended the era of MinIO as the default open-source S3 engine.”

But this didn’t come out of the blue. MinIO had been pursuing an “open source but don’t overdo it” strategy for years:

  • 2021: Silent switch from Apache 2.0 to AGPL v3 (no announcement, no PR, no nothing)
  • 2022-2023: Aggressive campaigns against Nutanix and Weka for “license violations”
  • February 2025: Web console, bucket management and replication removed from the Community version
  • October 2025: Stop distributing Docker images
  • December 2025: Maintenance mode

The message is clear: if you want MinIO for real, pay up. Their enterprise AIStor product starts at €96,000/year for 400 TiB. For 1 PB, we are talking about more than €244,000/year.

The lesson? In 2025, “Open Source” without Open Governance is worthless. MinIO was a company with an open source product, not a community project. The difference matters.

In the Meantime, Ceph Continues to Swim Peacefully

While MinIO was self-destructing, Ceph was celebrating its 20th stable release: Tentacle (v20.2.0), released in November 2025. The project accumulates more than 1 exabyte of storage deployed globally on more than 3,000 clusters.

Ceph Tentacle release mascot v20.2.0 November 2025

The most interesting thing about Tentacle is FastEC (Fast Erasure Coding), which improves the performance of small reads and writes by 2x to 3x. This makes erasure coding finally viable for workloads that are not pure cold file. With a 6+2 EC profile, you can now achieve approximately 50% of the performance of replication 3 while using only 33% of the space.

For those of us who have been hearing “erasure coding is slow for production” for years, this is a real game changer.

Other Tentacle news:

  • Integrated SMB support via Samba Manager
  • NVMe/TCP gateway groups with multi-namespace support
  • OAuth 2.0 authentication on the dashboard
  • CephFS case-insensitive directories (finally)
  • ISA-L replaces Jerasure (which was abandoned)

The Crimson OSD (based on Seastar for NVMe optimization) is still in technical preview. It is not production ready, but the roadmap is promising.

The Numbers That Matter

Bloomberg operates more than 100 PBs in Ceph clusters. They are a Diamond member of the Ceph Foundation and their Head of Storage Engineering is on the Governing Board. DigitalOcean has 54+ PBs in 37 production clusters. CERN maintains 50+ PBs in more than 10 clusters.

And here’s the interesting part: ZTE Corporation is among the top 3 global contributors to Ceph and number 1 in China. Its TECS CloveStorage product (based on Ceph) is deployed in more than 320 NFV projects worldwide, including China Mobile, izzi Telecom (Mexico) and Deutsche Telekom.

The telco sector is Ceph’s secret superpower. While many are still thinking of traditional appliances, telcos are running Ceph in production on a massive scale.

The Enterprise Ecosystem: Understanding What You’re Buying

This is where it gets interesting. And it’s worth understanding what’s behind each option.

Enterprise storage market bazaar comparison 2026

IBM Fusion: Two Flavors for Different Needs

IBM has two products under the Fusion brand, and it is important to understand the difference:

  • IBM Fusion HCI: Uses IBM Storage Scale ECE (the old GPFS/Spectrum Scale). Parallel file system with distributed erasure coding. Hyperconverged appliance that scales from 6 to 20 nodes.
  • IBM Fusion SDS: Uses OpenShift Data Foundation (ODF), which is based on Ceph packaged by Red Hat.

Storage Scale is a genuinely differentiated technology, especially for HPC. Its parallel file system architecture offers capabilities that Ceph simply doesn’t have: advanced metadata management, integrated tiering, AFM for federation…. If you have high-performance computing, supercomputing or AI workloads at serious scale, Storage Scale has solid technical arguments to justify it.

IBM Fusion HCI performance claims are impressive: 90x acceleration on S3 queries with local caching, performance equivalent to Databricks Photon at 60% of the cost, etc.

Now, it’s always worth asking the question: how much of that performance is proprietary technology and how much is simply well-dimensioned hardware with an appropriate configuration? It’s not a criticism, it’s a legitimate question that any architect should ask before making a decision.

In the case of Fusion SDS, you’re buying Ceph with the added value of packaging, OpenShift integration, and IBM enterprise support. For many organizations, that has real value.

Red Hat Ceph Storage: The Enterprise Standard

Red Hat Ceph Storage continues to be the enterprise distribution of choice. They offer 36 months of production support plus optional 24 months of extended support. The product is robust and well integrated.

What you are really buying is: tested and certified packages, 24/7 enterprise support, predictable life cycles, and OpenShift integration.

Is it worth it? It depends on your context. If your organization needs a support contract to comply with compliance or simply to sleep easy, probably yes. And we’d be happy to help you with that. But if you have the technical team to operate Ceph upstream, it’s a decision that deserves analysis.

SUSE: A Lesson in Vendor Lock-in

Here’s a story that bears reflection: SUSE completely exited the Ceph enterprise market. Their SUSE Enterprise Storage (SES) product reached end of support in January 2023. After acquiring Rancher Labs in 2020, they pivoted to Longhorn for Kubernetes-native storage.

If you were an SES customer, you found yourself orphaned. Your options were to migrate to Red Hat Ceph Storage, Canonical Charmed Ceph, community Ceph, or find a specialized partner to help you with the transition.

This is not a criticism of SUSE; companies pivot according to their strategy. But it is a reminder that control over your infrastructure has value that doesn’t always show up in TCO.

Pure Storage and NetApp: The Appliance Approach

Pure Storage has created a category called “Unified Fast File and Object” (UFFO) with its FlashBlade family. Impressive hardware, consistent performance, polished user experience. Its FlashBlade//S R2 scales up to 60 PB per cluster with 150 TB DirectFlash Modules.

NetApp StorageGRID 12.0 focuses on AI with 20x throughput improvements via advanced caching and support for more than 600 billion objects in a single cluster.

Both are solid solutions that compete with Ceph RGW in S3-compatible object storage. The performance is excellent. The question each organization must ask itself is whether the premium justifies the vendor lock-in for their specific use case.

The Question No One Asks: What Are You Really Buying?

This is where I put on my thoughtful engineer’s hat.

Ceph upstream is extremely stable. It has 20 releases under its belt. The Ceph Foundation includes IBM, Red Hat, Bloomberg, DigitalOcean, OVHcloud and dozens more. Development is active, the community is strong, and documentation is extensive.

So when does it make sense to pay for enterprise distribution and when does it not?

It makes sense when: your organization requires a compliance support contract or internal policy; you don’t have the technical staff to operate Ceph and you don’t want to develop it; you need predictable and tested upgrade cycles; the cost of downtime is higher than the cost of the license; or you need specific integration with other vendor products.

It deserves further analysis when: the decision is based on “it’s what everyone does”; no one has really evaluated the alternatives; the main reason is that “the vendor told us that open source is not supported”; or you have capable technical equipment but have not invested in their training.

The real challenge is knowledge. Ceph has a steep learning curve. Designing a cluster correctly, understanding CRUSH maps, tuning BlueStore, optimizing placement groups… this requires serious training and hands-on experience.

But once you have that knowledge, you have options. You can choose an enterprise vendor judiciously, knowing exactly what value-add you are buying. Or you can operate upstream with specialized support. The key is to make an informed decision.

Demystifying Marketing Claims

One thing I always recommend is to read benchmarks and marketing claims with a constructive critical spirit.

“Our product is 90x faster” – Compared to what baseline? On what specific workload? With what competitor configuration?

“Performance equivalent to [competitor] at 60% of cost” – Does this include full TCO? Licensing, support, training, personnel?

“Enterprise-grade certified solution” – what exactly does that mean? Because Ceph upstream is also enterprise-grade at CERN, Bloomberg, and hundreds of telecoms.

I’m not saying the claims are false. I am saying that context matters. The reality is that distributed storage performance is highly dependent on correct cluster design (failure domains, placement groups), appropriate hardware (25/100GbE network, NVMe with power-loss protection), operating system configuration (IOMMU, CPU governors), and workload specific tuning (osd_memory_target, bluestore settings).

A well-designed Ceph cluster operated by experienced people can achieve impressive performance. The Clyso benchmark achieved 1 TiB/s with 68 Dell PowerEdge servers. IBM demonstrated over 450,000 IOPS on a 4-node Ceph cluster with 24 NVMe per node.

Sometimes, that “certified solution” seal you see on a datasheet is, at heart, free software with an expert configuration, well-dimensioned hardware, and a lot of testing. Which has value, but it’s good to know.

The Smart Move: Deciding With Information

After 15 years in this industry, my conclusion is that there is no single answer. What there is are informed decisions.

For some organizations, a packaged enterprise solution is exactly what they need: guaranteed support, predictable cycle times, validated integration, and the peace of mind of having an accountable vendor. IBM Fusion with Storage Scale is an excellent choice for HPC. Red Hat Ceph Storage is solid for anyone who wants Ceph with enterprise backup.

For other organizations, Ceph upstream with specialized support and training offers significant advantages:

  1. Foundation governance: Ceph is a project of the Linux Foundation with open governance. MinIO can’t happen.
  2. Active community: Thousands of contributors, regular releases, bugs fixed quickly.
  3. Flexibility: It’s your cluster, your configuration. If tomorrow you decide to change your support partner, you lose nothing.
  4. Transparent TCO: The software is free. You invest in appropriate hardware and knowledge.
  5. Version control: You update when it makes sense for you, not when the vendor puts out the next packaged release.

The common denominator in both cases is knowledge. Whether you buy an enterprise solution or deploy upstream, understanding Ceph in depth allows you to make better decisions, negotiate better with vendors, and solve problems faster.

Where to Get This Knowledge

Ceph is complex. But there are clear paths:

The official documentation is extensive and much improved. The Ceph blog has excellent technical deep-dives.

Cephalocon is the annual conference where you can learn from those who operate Ceph at full scale (Bloomberg, CERN, DigitalOcean).

Structured training with hands-on labs is the most efficient way to build real competence. You don’t learn Ceph by reading slides; you learn by breaking and fixing clusters.

L3 technical support from people who live Ceph every day gets you out of trouble when things get complicated in production. Because they will. At SIXE, we’ve spent years training technical teams in Ceph and providing L3 support to organizations of all types: those operating upstream, those using enterprise distributions, and those evaluating options. Our Ceph training program covers everything from basic architecture to advanced operations with real hands-on labs. And if you already have Ceph in production and need real technical support, our specialized technical support is designed for exactly that.

In addition, we have just launched a certification program with badges in Credly so that your team can demonstrate their competencies in a tangible way. Because in this industry, “be Ceph” doesn’t mean the same thing to everyone.

Conclusions for 2026

  1. MinIO is dead for serious use. Look for alternatives. Ceph RGW, SeaweedFS, or even the OpenMaxIO fork if you are brave.
  2. Understand what you are buying. There are cases where a packaged enterprise solution brings real value. There are others where you are mainly paying for a logo and a configuration that you could replicate.
  3. Ceph upstream is mature and production-ready. Bloomberg, DigitalOcean, CERN and 320+ telecoms projects can’t all be wrong.
  4. The true cost of distributed storage is knowledge. Invest in quality training and support, regardless of which option you choose.
  5. Control over your infrastructure has value. Ask SUSE SES customers how it went when the vendor decided to pivot.
  6. The governance of the project matters as much as the technology. Open Foundation > company with open source product.

2026 looks interesting. FastEC is going to change the erasure coding equation. Integration with AI and ML will continue to push for more performance. And the vendor ecosystem will continue to evolve with proposals that deserve serious evaluation.

You decide. This is the only important thing.

Porting MariaDB to IBM AIX (Part 2): how AIX matches Linux

From “AIX is Slow” to “AIX Matches Linux” (with the right tools and code)

In Part 1, I wrestled with CMake, implemented a thread pool from scratch, and shipped a stable MariaDB 11.8.5 for AIX. The server passed 1,000 concurrent connections, 11 million queries, and zero memory leaks.

Then I ran a vector search benchmark.

AIX: 42 queries per second.
Linux (same hardware): 971 queries per second.

Twenty-three times slower. On identical IBM Power S924 hardware. Same MariaDB version. Same dataset.

This is the story of how we discovered there was no performance gap at all — just configuration mistakes and a suboptimal compiler.

Chapter 1: The Sinking Feeling

There’s a particular kind of despair that comes from seeing a 23x performance gap on identical hardware. It’s the “maybe I should have become a florist” kind of despair.

Let me set the scene: both machines are LPARs running on IBM Power S924 servers with POWER9 processors at 2750 MHz. Same MariaDB 11.8.5. Same test dataset — 100,000 vectors with 768 dimensions, using MariaDB’s MHNSW (Hierarchical Navigable Small World) index for vector search.

The benchmark was simple: find the 10 nearest neighbors to a query vector. The kind of operation that powers every AI-enhanced search feature you’ve ever used.

Linux did it in about 1 millisecond. AIX took 24 milliseconds.

My first instinct was denial. “The benchmark must be wrong.” It wasn’t. “Maybe the index is corrupted.” It wasn’t. “Perhaps the network is slow.” It was a local socket connection.

Time to dig in.

Chapter 2: The First 65x — Configuration Matters

The Cache That Forgot Everything

The first clue came from MariaDB’s profiler. Every single query was taking the same amount of time, whether it was the first or the hundredth. That’s not how caches work.

I checked MariaDB’s MHNSW configuration:

SHOW VARIABLES LIKE 'mhnsw%';
mhnsw_max_cache_size: 16777216

16 MB. Our vector graph needs about 300 MB to hold the HNSW structure in memory.

Here’s the kicker: when the cache fills up, MariaDB doesn’t evict old entries (no LRU). It throws everything away and starts fresh. Every. Single. Query.

Imagine a library where, when the shelves get full, the librarian burns all the books and orders new copies. For every patron.

Fix: mhnsw_max_cache_size = 4GB in the server configuration.

Result: 42 QPS → 112 QPS. 2.7x improvement from one config line.

The Page Size Problem

AIX defaults to 4 KB memory pages. Linux on POWER uses 64 KB pages.

For MHNSW’s access pattern — pointer-chasing across a 300 MB graph — this matters enormously. With 4 KB pages, you need 16x more TLB (Translation Lookaside Buffer) entries to map the same amount of memory. TLB misses are expensive.

Think of it like navigating a city. With 4 KB pages, you need directions for every individual building. With 64 KB pages, you get directions by neighborhood. Much faster when you’re constantly jumping around.

Fix: Wrapper script that sets LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STACKPSIZE=64K@SHMPSIZE=64K

Result: 112 QPS → 208 QPS sequential, and 2,721 QPS with 12 parallel workers.

The Scoreboard After Phase 1

ConfigurationSequential QPSWith 12 Workers
Baseline42~42
+ 4GB cache112
+ 64K pages2082,721

65x improvement from two configuration changes. No code modifications.

But we were still 6x slower than Linux per-core. The investigation continued.

Chapter 3: The CPU vs Memory Stall Mystery

With configuration fixed, I pulled out the profiling tools. MariaDB has a built-in profiler that breaks down query time by phase.

AIX:

Sending data: 4.70ms total
  - CPU_user: 1.41ms
  - CPU_system: ~0ms
  - Stalls: 3.29ms (70% of total!)

Linux:

Sending data: 0.81ms total
  - CPU_user: 0.80ms
  - Stalls: ~0.01ms (1% of total)

The CPU execution time was 1.8x slower on AIX — explainable by compiler differences. But the memory stalls were 329x worse.

The Root Cause: Hypervisor Cache Invalidation

Here’s something that took me two days to figure out: in a shared LPAR (Logical Partition), the POWER hypervisor periodically preempts virtual processors to give time to other partitions. When it does this, it may invalidate L2/L3 cache lines.

MHNSW’s graph traversal is pointer-chasing across 300 MB of memory — literally the worst-case scenario for cache invalidation. You’re jumping from node to node, each in a different part of memory, and the hypervisor is periodically flushing your cache.

It’s like trying to read a book while someone keeps closing it and putting it back on the shelf.

The Linux system had dedicated processors. The AIX system was running shared. Not apples to apples.

But before I could test dedicated processors, I needed to fix the compiler problem.

Chapter 4: The Compiler Odyssey

Everything I Tried With GCC (And Why It Failed)

AttemptResultWhy
-flto (Link Time Optimization)ImpossibleGCC LTO requires ELF format; AIX uses XCOFF
-fprofile-generate (PGO)Build failsTOC-relative relocation assembler errors
-ffast-mathBreaks everythingIEEE float violations corrupt bloom filter hashing
-funroll-loopsSlowerInstruction cache bloat — POWER9 doesn’t like it
-finline-functionsSlowerSame I-cache problem

The AIX Toolbox GCC is built without LTO support. It’s not a flag you forgot — it’s architecturally impossible because GCC’s LTO implementation requires ELF, and AIX uses XCOFF.

Ubuntu’s MariaDB packages use -flto=auto. That optimization simply doesn’t exist for AIX with GCC.

IBM Open XL: The Plot Twist

At this point, I’d spent three days trying to make GCC faster. Time to try something different.

IBM Open XL C/C++ 17.1.3 is IBM’s modern compiler, based on LLVM/Clang. It generates significantly better code for POWER9 than GCC.

Building MariaDB with Open XL required solving five different problems:

  1. Missing HTM header: Open XL doesn’t have GCC’s htmxlintrin.h. I created a stub.
  2. 32-bit AR by default: AIX tools default to 32-bit. Set OBJECT_MODE=64.
  3. Incompatible LLVM AR: Open XL’s AR couldn’t handle XCOFF. Used system /usr/bin/ar.
  4. OpenSSL conflicts: Used -DWITH_SSL=system to avoid bundled wolfSSL issues.
  5. Missing library paths: Explicit -L/opt/freeware/lib for the linker.

Then I ran the benchmark:

Compiler30 QueriesPer-Query
GCC 13.3.00.190s6.3ms
Open XL 17.1.30.063s2.1ms

Three times faster. Same source code. Same optimization flags (-O3 -mcpu=power9).

And here’s the bonus: GCC’s benchmark variance was 10-40% between runs. Open XL’s variance was under 2%. Virtually no jitter.

Why Such a Huge Difference?

Open XL (being LLVM-based) has:

  • Better instruction scheduling for POWER9’s out-of-order execution
  • Superior register allocation
  • More aggressive optimization passes

GCC’s POWER/XCOFF backend simply isn’t as mature. The AIX Toolbox GCC is functional, but it’s not optimized for performance-critical workloads.

Chapter 5: The LTO and PGO Dead Ends

Hope springs eternal. Maybe Open XL’s LTO and PGO would work?

LTO: The Irony

Open XL supports -flto=full on XCOFF. It actually builds! But…

Result: 27% slower than non-LTO Open XL.

Why? AIX shared libraries require an explicit export list (exports.exp). With LTO, CMake’s script saw ~27,000 symbols to export.

LTO’s main benefit is internalizing functions — marking them as local so they can be optimized away or inlined. When you’re forced to export 27,000 symbols, none of them can be internalized. The LTO overhead (larger intermediate files, slower link) remains, but the benefit disappears.

It’s like paying for a gym membership and then being told you can’t use any of the equipment.

PGO: The Profiles That Never Were

Profile-Guided Optimization sounded promising:

  1. Build with -fprofile-generate
  2. Run training workload
  3. Rebuild with -fprofile-use
  4. Enjoy faster code

Step 1 worked. Step 2… the profiles never appeared.

I manually linked the LLVM profiling runtime into the shared library. Still no profiles.

The root cause: LLVM’s profiling runtime uses atexit() or __attribute__((destructor)) to write profiles on exit. On AIX with XCOFF, shared library destructor semantics are different from ELF. The handler simply isn’t called reliably for complex multi-library setups like MariaDB.

Simple test cases work. Real applications don’t.

Chapter 6: The LPAR Revelation

Now I had a fast compiler. Time to test dedicated processors and eliminate the hypervisor cache invalidation issue.

The Test Matrix

LPAR ConfigGCCOpen XL
12 shared vCPUs0.190s0.063s
12 dedicated capped0.205s0.082s
21 dedicated capped0.320s0.067s

Wait. Shared is faster than dedicated?

The WoF Factor

POWER9 has a feature called Workload Optimized Frequency (WoF). In shared mode with low utilization, a single core can boost to ~3.8 GHz. Dedicated capped processors are locked at 2750 MHz.

For a single-threaded query, shared mode gets 38% more clock speed. That beats the cache invalidation penalty for this workload.

Think of it like choosing between a sports car on a highway with occasional traffic (shared) versus a truck with a reserved lane but a speed limit (dedicated capped).

The PowerVM Donating Mode Disaster

There’s a third option: dedicated processors in “Donating” mode, which donates idle cycles back to the shared pool.

ModeGCCOpen XL
Capped0.205s0.082s
Donating0.325s0.085s

60% regression with GCC.

Every time a query bursts, there’s latency reclaiming the donated cycles. For bursty, single-threaded workloads like database queries, this is devastating.

Recommendation: Never use Donating mode for database workloads.

The 21-Core Sweet Spot

With 21 dedicated cores (versus Linux’s 24), Open XL achieved 0.067s — nearly matching the 0.063s from shared mode. The extra L3 cache from more cores compensates for the lack of WoF frequency boost.

Chapter 7: The Final Scoreboard (Plot Twist)

Fresh benchmarks on identical POWER9 hardware, January 2026:

PlatformCores30 Queries
Linux24 dedicated0.057s
AIX + Open XL12 shared0.063s
AIX + Open XL21 dedicated0.067s
AIX + GCC12 shared0.190s
AIX + GCC21 dedicated0.320s

Wait. The AIX system has 21 cores vs Linux’s 24. That’s 12.5% fewer cores, which means 12.5% less L3 cache.

The measured “gap”? 10-18%.

That’s not a performance gap. That’s a hardware difference.

With IBM Open XL, AIX delivers identical per-core performance to Linux. The 23x gap we started with? It was never about AIX being slow. It was:

  1. A misconfigured cache (16MB instead of 4GB)
  2. Wrong page sizes (4KB instead of 64KB)
  3. The wrong compiler (GCC instead of Open XL)

The “AIX is slow” myth is dead.

The Complete Failure Museum

Science isn’t just about what works — it’s about documenting what doesn’t. Here’s our wall of “nice try, but no”:

What We TriedResultNotes
mhnsw_max_cache_size = 4GB5x fasterEliminates cache thrashing
LDR_CNTRL 64K pages~40% fasterReduces TLB misses
MAP_ANON_64K mmap patch~8% fasterMinor TLB improvement
IBM Open XL 17.1.33x fasterBetter POWER9 codegen
Shared LPAR (vs dedicated)~25% fasterWoF frequency boost
Open XL + LTO27% slowerAIX exports conflict
Open XL + PGODoesn’t workProfiles not written
GCC LTOImpossibleXCOFF not supported
GCC PGOBuild failsTOC relocation errors
-ffast-mathBreaks MHNSWFloat corruption
-funroll-loopsWorseI-cache bloat
POWER VSX bloom filter41% slowerNo 64-bit vec multiply on P9
Software prefetchNo effectHypervisor evicts prefetched data
DSCR tuningBlockedHypervisor controls DSCR in shared LPAR
Donating mode60% regressionNever use for databases

The VSX result is particularly interesting: we implemented a SIMD bloom filter using POWER’s vector extensions. It was 41% slower than scalar. POWER9 has no 64-bit vector multiply — you need vec_extract → scalar multiply → vec_insert for each lane, which is slower than letting the Out-of-Order engine handle a scalar loop.

What I Learned

1. Defaults Matter More Than You Think

A 16 MB cache default turned sub-millisecond queries into 24ms queries. That’s a 24x penalty from one misconfigured parameter.

When you’re porting software, question every default. What works on Linux might not work on your platform.

2. The “AIX is Slow” Myth Was Always a Toolchain Issue

With GCC, we were 3-4x slower than Linux. With Open XL, we match Linux per-core.

The platform was never slow. The default toolchain just wasn’t optimized for performance-critical workloads. Choose the right compiler.

3. Virtualization Has Hidden Trade-offs

Shared LPAR can be faster than dedicated for single-threaded workloads (WoF frequency boost). Dedicated is better for sustained multi-threaded throughput. Donating mode is a trap.

Know your workload. Choose your LPAR configuration accordingly.

4. Not Every Optimization Ports

LTO, PGO, and SIMD vectorization all failed on AIX for various reasons. The techniques that make Linux fast don’t always translate.

Sometimes the “obvious” optimization is the wrong choice. Measure everything.

5. Sometimes There’s No Gap At All

We spent days investigating a “performance gap” that turned out to be:

  • Configuration mistakes
  • Wrong compiler
  • Fewer cores on the test system

The lesson: verify your baselines. Make sure you’re comparing apples to apples before assuming there’s a problem to solve.

Recommendations

For AIX MariaDB Users

  1. Use the Open XL build (Release 3, coming soon)
  2. Set mhnsw_max_cache_size to at least 4GB for vector search
  3. Keep shared LPAR for single-query latency
  4. Never use Donating mode for databases
  5. Use 64K pages via the LDR_CNTRL wrapper

For Upstream MariaDB

  1. Increase mhnsw_max_cache_size default — 16MB is far too small
  2. Implement LRU eviction — discarding the entire cache on overflow is brutal
  3. Don’t add POWER VSX bloom filter — scalar is faster on POWER9

What’s Next

The RPMs are published at aix.librepower.org. Release 2 includes the configuration fixes. Release 3 with Open XL build is also available.

Immediate priorities:

  • Commercial Open XL license: Evaluation expires soon. Need to verify with IBM if we are ok using xLC for this purpose.
  • Native AIO implementation: AIX has POSIX AIO and Windows-compatible IOCP. Time to write the InnoDB backend.
  • Upstream MHNSW feedback: The default mhnsw_max_cache_size of 16MB is too small for real workloads; we’ll suggest a larger default.

For organizations already running mission-critical workloads on AIX — and there are many, from banks to airlines to healthcare systems — the option to also run modern, high-performance MariaDB opens new possibilities.

AIX matches Linux. The myth is dead. And MariaDB on AIX is ready for production.

TL;DR

  • Started with 23x performance gap (42 QPS vs 971 QPS)
  • Fixed cache config: 5x improvement
  • Fixed page size: ~40% more
  • Switched to IBM Open XL: 3x improvement over GCC
  • Used shared LPAR: ~25% faster than dedicated (WoF boost)
  • Final result: NO GAP — 10% difference = 12.5% fewer cores (21 vs 24)
  • AIX matches Linux per-core performance with Open XL
  • Open XL LTO: doesn’t help (27% slower)
  • Open XL PGO: doesn’t work (AIX XCOFF issue)
  • POWER VSX SIMD: 41% slower than scalar (no 64-bit vec multiply)
  • Donating mode: 60% regression — never use for databases
  • “AIX is slow for Open Source DBs” was always a toolchain myth

Questions? Ideas? Running MariaDB on AIX and want to share your experience?

This work is part of LibrePower – Unlocking IBM Power Systems through open source. Unmatched RAS. Superior TCO. Minimal footprint 🌍

LibrePower AIX project repository: gitlab.com/librepower/aix

Porting MariaDB to IBM AIX (Part 1): 3 Weeks of Engineering Pain

Bringing MariaDB to AIX, the Platform That Powers the World’s Most Critical Systems

There are decisions in life you make knowing full well they’ll cause you some pain. Getting married. Having children. Running a marathon. Porting MariaDB 11.8 to IBM AIX.

This (Part 1) is the story of the last one — and why I’d do it again in a heartbeat.

Chapter 1: “How Hard Can It Be?”

It all started with an innocent question during a team meeting: “Why don’t we have MariaDB on our AIX systems?”

Here’s the thing about AIX that people who’ve never worked with it don’t understand: AIX doesn’t mess around. When banks need five-nines uptime for their core banking systems, they run AIX. When airlines need reservation systems that cannot fail, they run AIX. When Oracle, Informix, or DB2 need to deliver absolutely brutal performance for mission-critical OLTP workloads, they run on AIX.

AIX isn’t trendy. AIX doesn’t have a cool mascot. AIX won’t be the subject of breathless tech blog posts about “disruption.” But when things absolutely, positively cannot fail — AIX is there, quietly doing its job while everyone else is busy rebooting their containers.

So why doesn’t MariaDB officially support AIX? Simple economics: the open source community has centered on Linux, and porting requires platform-specific expertise. MariaDB officially supports Linux, Windows, FreeBSD, macOS, and Solaris. AIX isn’t on the list — not because it’s a bad platform, but because no one had done the work yet.

At LibrePower, that’s exactly what we do.

My first mistake was saying out loud: “It’s probably just a matter of compiling it and adjusting a few things.”

Lesson #1: When someone says “just compile it” about software on AIX, they’re about to learn a lot about systems programming.

Chapter 2: CMake and the Three Unexpected Guests

Day one of compilation was… educational. CMake on AIX is like playing cards with someone who has a very different understanding of the rules — and expects you to figure them out yourself.

The Ghost Function Bug

AIX has an interesting characteristic: it declares functions in headers for compatibility even when those functions don’t actually exist at runtime. It’s like your GPS saying “turn right in 200 meters” but the street is a brick wall.

CMake does a CHECK_C_SOURCE_COMPILES to test if pthread_threadid_np() exists. The code compiles. CMake says “great, we have it!” The binary starts and… BOOM. Symbol not found.

Turns out pthread_threadid_np() is macOS-only. AIX declares it in headers because… well, I’m still not entirely sure. Maybe for some POSIX compatibility reason that made sense decades ago? Whatever the reason, GCC compiles it happily, and the linker doesn’t complain until runtime.

Same story with getthrid(), which is OpenBSD-specific.

The fix:

IF(NOT CMAKE_SYSTEM_NAME MATCHES "AIX")
  CHECK_C_SOURCE_COMPILES("..." HAVE_PTHREAD_THREADID_NP)
ELSE()
  SET(HAVE_PTHREAD_THREADID_NP 0)  # Trust but verify... okay, just verify
ENDIF()

poll.h: Hide and Seek

AIX has <sys/poll.h>. It’s right there. You can cat it. But CMake doesn’t detect it.

After three hours debugging a “POLLIN undeclared” error in viosocket.c, I discovered the solution was simply forcing the define:

cmake ... -DHAVE_SYS_POLL_H=1

Three hours. For one flag.

(To be fair, this is a CMake platform detection issue, not an AIX issue. CMake’s checks assume Linux-style header layouts.)

The Cursed Plugins

At 98% compilation — 98%! — the wsrep_info plugin exploded with undefined symbols. Because it depends on Galera. Which we’re not using. But CMake compiles it anyway.

Also S3 (requires Aria symbols), Mroonga (requires Groonga), and RocksDB (deeply tied to Linux-specific optimizations).

Final CMake configuration:

-DPLUGIN_MROONGA=NO -DPLUGIN_ROCKSDB=NO -DPLUGIN_SPIDER=NO 
-DPLUGIN_TOKUDB=NO -DPLUGIN_OQGRAPH=NO -DPLUGIN_S3=NO -DPLUGIN_WSREP_INFO=NO

It looks like surgical amputation, but it’s actually just trimming the fat. These plugins are edge cases that few deployments need.

Chapter 3: Thread Pool, or How I Learned to Stop Worrying and Love the Mutex

This is where things got interesting. And by “interesting” I mean “I nearly gave myself a permanent twitch.”

MariaDB has two connection handling modes:

  • one-thread-per-connection: One thread per client. Simple. Scales like a car going uphill.
  • pool-of-threads: A fixed pool of threads handles all connections. Elegant. Efficient. And not available on AIX.

Why? Because the thread pool requires platform-specific I/O multiplexing APIs:

PlatformAPIStatus
LinuxepollSupported
FreeBSD/macOSkqueueSupported
Solarisevent portsSupported
WindowsIOCPSupported
AIXpollsetNot supported (until now)

So… how hard can implementing pollset support be?

(Editor’s note: At this point the author required a 20-minute break and a beverage)

The ONESHOT Problem

Linux epoll has a wonderful flag called EPOLLONESHOT. It guarantees that a file descriptor fires events only once until you explicitly re-arm it. This prevents two threads from processing the same connection simultaneously.

AIX pollset is level-triggered. Only level-triggered. No options. If data is available, it reports it. Again and again and again. Like a helpful colleague who keeps reminding you about that email you haven’t answered yet.

Eleven Versions of Increasing Wisdom

What followed were eleven iterations of code, each more elaborate than the last, trying to simulate ONESHOT behavior:

v1-v5 (The Age of Innocence)

I tried modifying event flags with PS_MOD. “If I change the event to 0, it’ll stop firing,” I thought. Spoiler: it didn’t stop firing.

v6-v7 (The State Machine Era)

“I know! I’ll maintain internal state and filter duplicate events.” The problem: there’s a time window between the kernel giving you the event and you updating your state. In that window, another thread can receive the same event.

v8-v9 (The Denial Phase)

“I’ll set the state to PENDING before processing.” It worked… sort of… until it didn’t.

v10 (Hope)

Finally found the solution: PS_DELETE + PS_ADD. When you receive an event, immediately delete the fd from the pollset. When you’re ready for more data, add it back.

// On receiving events: REMOVE
for (i = 0; i < ret; i++) {
    pctl.cmd = PS_DELETE;
    pctl.fd = native_events[i].fd;
    pollset_ctl(pollfd, &pctl, 1);
}

// When ready: ADD
pce.command = PS_ADD;
pollset_ctl_ext(pollfd, &pce, 1);

It worked! With -O2.

With -O3segfault.

The Dark Night of the Soul (The -O3 Bug)

Picture my face. I have code working perfectly with -O2. I enable -O3 for production benchmarks and the server crashes with “Got packets out of order” or a segfault in CONNECT::create_thd().

I spent two days thinking it was a compiler bug. GCC 13.3.0 on AIX. I blamed the compiler. I blamed the linker. I blamed everything except my own code.

The problem was subtler: MariaDB has two concurrent code paths calling io_poll_wait on the same pollset:

  • The listener blocks with timeout=-1
  • The worker calls with timeout=0 for non-blocking checks

With -O2, the timing was such that these rarely collided. With -O3, the code was faster, collisions happened more often, and boom — race condition.

v11 (Enlightenment)

The fix was a dedicated mutex protecting both pollset_poll and all pollset_ctl operations:

static pthread_mutex_t pollset_mutex = PTHREAD_MUTEX_INITIALIZER;

int io_poll_wait(...) {
    pthread_mutex_lock(&pollset_mutex);
    ret = pollset_poll(pollfd, native_events, max_events, timeout);
    // ... process and delete events ...
    pthread_mutex_unlock(&pollset_mutex);
}

Yes, it serializes pollset access. Yes, that’s theoretically slower. But you know what’s even slower? A server that crashes.

The final v11 code passed 72 hours of stress testing with 1,000 concurrent connections. Zero crashes. Zero memory leaks. Zero “packets out of order.”

Chapter 4: The -blibpath Thing (Actually a Feature)

One genuine AIX characteristic: you need to explicitly specify the library path at link time with -Wl,-blibpath:/your/path. If you don’t, the binary won’t find libstdc++ even if it’s in the same directory.

At first this seems annoying. Then you realize: AIX prefers explicit, deterministic paths over implicit searches. In production environments where “it worked on my machine” isn’t acceptable, that’s a feature, not a bug.

Chapter 5: Stability — The Numbers That Matter

After all this work, where do we actually stand?

The RPM is published at aix.librepower.org and deployed on an IBM POWER9 system (12 cores, SMT-8). MariaDB 11.8.5 runs on AIX 7.3 with thread pool enabled. The server passed a brutal QA suite:

TestResult
100 concurrent connections
500 concurrent connections
1,000 connections
30 minutes sustained load
11+ million queries
Memory leaksZERO

1,648,482,400 bytes of memory — constant across 30 minutes. Not a single byte of drift. The server ran for 39 minutes under continuous load and performed a clean shutdown.

It works. It’s stable. It’s production-ready for functionality.

Thread Pool Impact

The thread pool work delivered massive gains for concurrent workloads:

ConfigurationMixed 100 clientsvs. Baseline
Original -O2 one-thread-per-connection11.34s
-O3 + pool-of-threads v111.96s83% faster

For high-concurrency OLTP workloads, this is the difference between “struggling” and “flying.”

What I Learned (So Far)

  1. CMake assumes Linux. On non-Linux systems, manually verify that feature detection is correct. False positives will bite you at runtime.
  2. Level-triggered I/O requires discipline. EPOLLONESHOT exists for a reason. If your system doesn’t have it, prepare to implement your own serialization.
  3. -O3 exposes latent bugs. If your code “works with -O2 but not -O3,” you have a race condition. The compiler is doing its job; the bug is yours.
  4. Mutexes are your friend. Yes, they have overhead. But you know what has more overhead? Debugging race conditions at 3 AM.
  5. AIX rewards deep understanding. It’s a system that doesn’t forgive shortcuts, but once you understand its conventions, it’s predictable and robust. There’s a reason banks still run it — and will continue to for the foreseeable future.
  6. The ecosystem matters. Projects like linux-compat from LibrePower make modern development viable on AIX. Contributing to that ecosystem benefits everyone.

What’s Next: The Performance Question

The server is stable. The thread pool works. But there’s a question hanging in the air that I haven’t answered yet:

How fast is it compared to Linux?

I ran a vector search benchmark — the kind of operation that powers AI-enhanced search features. MariaDB’s MHNSW (Hierarchical Navigable Small World) index, 100,000 vectors, 768 dimensions.

Linux on identical POWER9 hardware: 971 queries per second.

AIX with our new build: 42 queries per second.

Twenty-three times slower.

My heart sank. Three weeks of work, and we’re 23x slower than Linux? On identical hardware?

But here’s the thing about engineering: when numbers don’t make sense, there’s always a reason. And sometimes that reason turns out to be surprisingly good news.

In Part 2, I’ll cover:

  • How we discovered the 23x gap was mostly a configuration mistake
  • The compiler that changed everything
  • Why “AIX is slow” turned out to be a myth
  • The complete “Failure Museum” of optimizations that didn’t work

The RPMs are published at aix.librepower.org. The GCC build is stable and production-ready for functionality.

But the performance story? That’s where things get really interesting.

Part 2 coming soon.

TL;DR

  • MariaDB 11.8.5 now runs on AIX 7.3 with thread pool enabled
  • First-ever thread pool implementation for AIX using pollset (11 iterations to get ONESHOT simulation right)
  • Server is stable: 1,000 connections, 11M+ queries, zero memory leaks
  • Thread pool delivers 83% improvement for concurrent workloads
  • Initial vector search benchmark shows 23x gap vs Linux — but is that the whole story?
  • RPMs published at aix.librepower.org
  • Part 2 coming soon: The performance investigation

Questions? Ideas? Want to contribute to the AIX open source ecosystem?

This work is part of LibrePower – Unlocking IBM Power Systems through open source. Unmatched RAS. Superior TCO. Minimal footprint 🌍

LibrePower AIX project repository: gitlab.com/librepower/aix

🦙 LLMs on AIX: technical experimentation beyond the GPU hype

At LibrePower, we have published Llama-AIX: a proof-of-concept for running lightweight LLM inference directly on AIX , using only CPU and memory—no GPUs involved.

It’s worth clarifying from the start: this is technical fun and experimentation. It is not a product, not a commercial promise, and not an alternative to large GPU-accelerated AI platforms.

That said, there is a sound technical foundation behind this experiment.

Not all LLM use cases are GPU-bound.

In many common business scenarios in Power environments:

  • RAG (Retrieval Augmented Generation)
  • Questions about internal documentation
  • On-prem technical assistants
  • Semantic search on own knowledge
  • Text analytics with strong dependence on latency and proximity to data

the bottleneck is not always the massive calculation, but:

  • CPU
  • Memory width
  • Data access latency
  • Data localization

In these cases, small and well bounded inferences can be reasonably executed without GPUs, especially when the model is not the center of the system, but just another piece.

⚙️ CPU, MMA and low-power accelerators

The natural evolution does not only involve GPUs:

  • Increasingly vectorized CPUs
  • Extensions as MMA
  • Specific and low-power accelerators (such as the future Spyre)
  • Closer integration to the operating system and data stack

This type of acceleration is especially relevant in Power architectures, where the design prioritizes sustained throughput, consistency and reliability, not just FLOPS peaks.

🧩 Why AIX?

Running this on AIX is not a necessity, it is a conscious choice to:

  • Understanding the real limits
  • Explore its technical feasibility
  • Dismantling simplistic assumptions
  • Learning how LLMs fit into existing Power systems

Many Power customers operate stable, amortized and critical infrastructures, where moving data to the cloud or introducing GPUs is not always desirable or feasible.

🔍 What Llama-AIX is (and isn’t)

  • ✔ A technical PoC
  • ✔ An honest exploration
  • ✔ An engineering exercise
  • ✔ Open source
  • ✖ Not a benchmark
  • ✖ Not a complete AI platform
  • ✖ Not intended to compete with GPU solutions
  • ✖ Not “AI marketing”.

The idea is simple: look beyond the hype, understand the nuances and assess where LLMs bring real value in Power and AIX environments.

Purely out of technical curiosity.

And because experimenting is still a fundamental part of engineering.

💬 In what specific use case would an on-prem LLM in Power make sense to you?

#LibrePower #AIX #IBMPower #LLM #RAG #OpenSource #EnterpriseArchitecture #AIOnPrem

We launched LibrePower (and this is its Manifesto)

We want to unleash the full potential of IBM Power

We build community to grow the Power ecosystem – more solutions, more users, more value.

The most capable hardware you’re not yet taking full advantage of

IBM Power underpins mission-critical computing around the world. Banking transactions, flight reservations, hospital systems, SAP installations – workloads that can’t afford to fail run on Power.

This is no coincidence.

Power systems offer legendary reliability thanks to their RAS (Reliability, Availability, Serviceability) architecture that x86 simply cannot match. They run trouble-free for 10, 15 years or more. Their design – large caches, SMT8, extraordinary memory bandwidth – is designed to keep performance at scale in a sustained manner.

But there is an opportunity that most organizations are missing:

Power can do much more than what is usually asked of it.

The capacity is there. The potential is enormous. What has been missing is independent validation, momentum from the community and accessible tools that open the door to new use cases.

Exactly what LibrePower is building.


What is LibrePower?

LibrePower is a community initiative to extend what is possible in IBM Power – across the entire ecosystem:

  • AIX – The veteran Unix that runs the most demanding enterprise loads
  • IBM i – The integrated system that thousands of companies around the world run on
  • Linux on Power (ppc64le) – Ubuntu, Rocky, RHEL, SUSE, Fedora on POWER Architecture

We are not here to replace anything. We come to add:

  • More certified solutions running on Power
  • More developers and administrators relying on the platform
  • More reasons to invest in Power – both in new and existing equipment

What we do

1. LibrePower Certified – independent validation

ISVs and open source projects need to know that their software works on Power. Buyers need confidence before deploying. IBM certification has its value, but there is room for independent community-driven validation.

LibrePower Certified offers three levels:

LevelMeaning
Works on PowerCompile and run correctly on ppc64le. Basic validation.
Optimized for PowerTuned for SMT, VSX/MMA. Includes performance data.
🏆 LibrePower CertifiedFull validation + case study + ongoing support.

Free for open source projects. Premium levels for ISVs looking for deeper collaboration.

Open source repositories

We compile, package and distribute software that the Power community needs:

  • AIX(aix.librepower.org): modern CLI tools, security utilities, compatibility layers
  • ppc64le: container tools (AWX, Podman), development utilities, monitoring
  • IBM i: open source integration (under development)

Everything is free. Everything is documented. Everything is in GitLab.

3. Performance testing and optimization

Power hardware has unique features that most software does not take advantage of. We benchmark, identify opportunities and work with upstream projects to improve performance on Power.

When we find optimizations, we contribute them back. The entire ecosystem benefits.

4. Building community

The Power world is fragmented. AIX administrators, Linux on Power teams, IBM i environments – all solving similar problems in isolation.

LibrePower connects these communities:

  • Cross-platform knowledge sharing
  • Amplify the collective voice to manufacturers and projects
  • Create a network of expertise in Power

5. Expanding the market

Every new solution validated in Power is one more reason for organizations to choose the platform. Every developer who learns Power is talent for the ecosystem. Every success story demonstrates value.

More solutions → more adoption → stronger ecosystem → more investment in Power.

This virtuous circle benefits everyone: IBM, partners, ISVs and users.

Why should you care?

If you manage Power systems:

  • Access tools and solutions you were missing
  • Join a community that understands your environment
  • Maximize the return on your hardware investment

If you are a developer:

  • Learn a platform with unique technical features
  • Contribute to projects with real impact in the enterprise world.
  • Develops expertise in a high value niche

If you are an ISV:

  • Get your software validated in Power
  • Connect with enterprise customers
  • Discover market opportunities in the Power ecosystem

If you are evaluating infrastructure:

  • Find out what’s really possible in Power beyond traditional charging
  • Find independent validation of solutions
  • Connect with the community to learn about real experiences

What we are working on

AIX(aix.librepower.org)

  • ✅ Modern package manager (dnf/yum for AIX)
  • ✅ fzf – fuzzy search engine (Go binary compiled for AIX)
  • ✅ nano – modern editor
  • 2FA tools – Google Authenticator with QR codes
  • 🔄 Coming soon: ripgrep, yq, modern coreutils

Linux ppc64le

  • 🔄 AWX – Ansible automation (full port in progress)
  • 🔄 Container Tools – Podman, Buildah, Skopeo
  • 🔄 Development tools – lazygit, k9s, modern CLIs

IBM i

  • 📋 Planning phase – assessing priorities with community input.

The vision

Imagine:

  • Every major open source project considers Power at the time of release
  • ISVs see Power as a strategic platform, not an afterthought
  • Organizations deploy new loads on Power with confidence
  • A connected and growing community that powers the ecosystem

That’s The Power Renaissance.

It is not nostalgia for the past. It is not just extending the life of existing deployments.

Active expansion of what Power can do and who uses it.


Join

LibrePower grows with the community. This is how you can participate:

Who is behind it?

LibrePower is an initiative of SIXE, an IBM and Canonical Business Partner with more than 20 years of experience in the Power ecosystem.

We have seen what Power can do. We’ve seen what’s missing. Now we build what should exist.

LibrePower – Unlocking the potential of Power Systems through open source software 🌍. Unmatched RAS. Superior TCO. Minimal footprint. 🌍

What is OpenUEM? The Open Source revolution for device management

In the world of system administration, we often encounter oversized, expensive and complex tools. However, when we analyze what people are most looking for in terms of efficient alternatives, the name OpenUEM starts to ring a bell.

From SIXE, as specialists in infrastructure and open source, we want to answer the most frequently asked questions about this technology and explain why we have opted for it.

OpenUEM

What is OpenUEM and how does it work?

OpenUEM(Unified Endpoint Management) is an open source solution designed to simplify the lives of IT administrators. Unlike traditional suites that charge per agent or device, OpenUEM offers a centralized platform for inventory, software deployment and remote management of equipment without abusive licensing costs.

Its operation is very good because of its simplicity and efficiency:

  1. The agent: A small program is installed on the end equipment.

  2. The server: Collects the information in real time and allows to execute actions.

  3. The web console: From a browser, the administrator can view the entire IT park, install applications or connect remotely.

OpenUEM vs. other traditional tools

One of the most common doubts is how this tool compares to the market giants. We have made a list of pros and cons with SIXE’s perspective, so you can draw your own conclusions :)

  • In favor:

    • Cost: Being Open Source, you eliminate licensing costs. Ideal for SMBs and large deployments where the cost per agent skyrockets.

    • Privacy: It’s self-hosted. You control the data, not a third-party cloud.

    • Lightweight.

  • Against:

    • Being a younger tool, it may not (yet) have the infinite plugin ecosystem of solutions that have been on the market for 20 years. However, it more than covers 90% of the usual management needs.

How to integrate OpenUEM with your current IT infrastructure?

Integration is less traumatic than it seems. OpenUEM is designed to coexist with what you already have.

  • Software deployment: Integrates natively with repositories such as Winget (Windows) and Flatpak (Linux), using industry standards instead of closed proprietary formats.

  • Remote support: Incorporates proven technologies such as VNC, RDP and RustDesk so you can support remote employees without complex VPN configurations in many cases.

If you’re wondering how to set up OpenUEM to monitor employees remotely, the answer lies in its flexible architecture. The server can be deployed via Docker in minutes, allowing agents to report securely from any location with internet.

Who offers OpenUEM support and solutions for companies?

Although the software is free, companies need guarantees, support and a professional implementation. This is where we come in. At SIXE, we don’t just implement the tool; we offer the necessary business support so you can sleep easy. We know that integrating a new platform can raise questions about pricing or deployment models for small and medium-sized businesses. That’s why our approach is not to sell you a license (there aren’t even any), but to help you deploy, maintain and secure your device management infrastructure with OpenUEM.

Contact us!

If you are looking for a platform to manage your mobile and desktop devices that is transparent, auditable and cost-effective, OpenUEM may be a solution for you. Want to see how it would work in your environment? Take a look at our professional OpenUEM solution and find out how we can help you manage the control of your devices. For those who are more curious or want to play around with the tool on their own, we always recommend visiting the official OpenUEM repository.

Claroty xDome vs CTD: Cloud or local? Architecture analysis for your OT network

At SIXE we have been dealing with infrastructure, networks and security for more than 15 years. We have seen everything from PLCs connected “a la brava” to the internet, to hospital networks where a smart coffee maker could take down a critical server. So, when we talk about Industrial Cybersecurity (OT) and IoMT (Internet of Medical Things), we don’t marry just anyone. But, if you’ve gotten this far by Googling, you probably have a major name mess. What is xDome? what is CTD? which one do I need? Don’t worry, today we’re putting on our engineering overalls to explain it to you clearly.

Claroty interface visualizing OT assets


The big question: What is the difference between Claroty CTD and xDome?

This is the million-dollar question. Both solutions aim for the same thing: total visibility and protection of your industrial environment, but their architecture is radically different.

1. Claroty xDome (The future Cloud-Native)

It is the SaaS evolution. xDome is designed for companies that want to take advantage of the agility of the cloud without sacrificing security.

  • How does it work? It is a cloud-based solution that performs asset discovery, vulnerability management and threat detection.

  • The best: Deploys lightning fast and scales beautifully. Ideal if your organization already has a “Cloud First” mentality and you want to manage multiple sites from a single pane of glass without deploying tons of hardware.

2. Claroty CTD (Continuous Threat Detection – On-Premise)

It is the classic heavyweight for environments that, by regulation or philosophy, can’t (or won’t) touch the cloud.

  • How does it work? Everything stays in-house. It’s deployed on your own local infrastructure.

  • The best: Total data sovereignty. It is the preferred option for highly sensitive critical infrastructures (energy, defense) where data does not leave the physical perimeter under any circumstances.

SIXE’s advice: There is no one “better” than another, there is one that best suits your architecture. At SIXE we perform a thorough analysis before recommending anything.


And what does Claroty Edge have to do with all this?

Sometimes you don’t need to deploy a complete continuous monitoring infrastructure from day one, or you just need a quick audit to find out “what the heck I have connected in my plant”.

Claroty Edge requires no network changes (no SPAN ports, no complex inbound TAPs). It is an executable that you launch, it scans, gives you a complete “snapshot” of your assets and vulnerabilities in minutes, and closes without a trace.


Who does Claroty compete with and which are the best cybersecurity companies?

If you are evaluating software, names like Nozomi Networks, Dragos or Armis will surely ring a bell. They are the big “rivals” in the magic quadrant.

Which are the best? It depends on who you ask, but the technical reality is this:

  1. Claroty: Undisputed leader in protocol comprehensiveness (speaks the language of your machines, whether Siemens, Rockwell, Schneider, etc.) and its integration with medical environments (Medigate).

  2. Nozomi: Very strong in passive visibility.

  3. Dragos: Very focused on pure threat intelligence.

Why did SIXE choose Claroty? Because we understand what’s underneath the software. IT/OT convergence is complex and Claroty offers the most complete suite (Secure Remote Access + CTD/xDome). It doesn’t just tell you “there’s a virus”, it allows you to manage third-party remote access (goodbye to insecure vendor VPNs) and segment the network correctly.

If you want to learn more about how industrial safety standards compare globally, you can take a look at the following standards IEC 62443which is the bible we follow for these implementations.


Implementing Claroty is not just about installing, it’s about understanding

This is where we come in. A powerful tool configured by inexperienced hands is just a generator of noise and false positives.

At SIXE we do not limit ourselves to implementation, we think about:

  • Design the architecture: We plan where to put the sensors (SPAN Ports, TAPs) so as not to generate latency. The plant can NOT be stopped.

  • Fine-tune policies: A false positive in a plant can mean an engineer running at 3 am. We adjust the tool to the reality of your protocols (Modbus, Profinet, CIP).

  • Train your team: We train your SOC to understand that an OT alert is not the same as an IT alert.

👉 Discover how we implemented Claroty at SIXE

How to learn Ceph | The reality that NOBODY is telling

 

“I launch commands but I don’t understand anything”. “I read documentation but when something fails I don’t even know where to start.” “I’ve been with Ceph for a year and I feel like I’m barely scratching the surface.” If any of these phrases resonate with you, you’re not alone. And most importantly: it’s not your fault.

After more than 10 years working with Ceph in production, teaching hundreds of administrators, and rescuing “impossible” clusters at 3AM, we have come to a conclusion that no one will tell you in official certifications: Ceph is brutally difficult to master. And not because you’re a bad administrator, but because the technology is inherently complex, constantly evolving, and the documentation assumes knowledge that no one explicitly taught you.

We’re not going to sell you a “learn Ceph in 30 days”. We want to tell you the truth about how you really learn, how long it takes, what misunderstandings will hold you back, and what is the most effective route to go from blindly throwing commands to having real expertise ;)

Why Ceph is so hard to learn (and it’s not your fault)

Complexity is not accidental: it is inherent

Ceph is not “just another storage system”. It is a massively distributed architecture that must solve simultaneously:

  • Write consistency with multi-node replication and distributed consensus
  • Continuous availability in the event of hardware failures (disks, nodes, complete racks)
  • Automatic rebalancing of petabytes of data with no downtime
  • Predictable performance under variable and multi-tenant loads
  • Three completely different interfaces (block, object, filesystem) on the same basis
  • Integration with multiple ecosystems (OpenStack, Kubernetes, traditional virtualization)

Each of these capabilities separately is a complex system. Ceph integrates them all. And here’s the problem: you can’t understand one without understanding the others. A complex, interlocking machine or puzzle with multiple layers. Six distinct, colorful streams of data (representing: Data Consistency, High Availability, Auto-balancing, Predictable Performance, Multiple Interfaces, Ecosystem Integration) flow into the center. In the center, a single, robust engine labeled

Beginner’s Mistake #1: Trying to learn Ceph as if it were just another stateless service. “I configure, issue commands, and it should work”. No. Ceph is a distributed system with shared state, consensus between nodes, and emergent behaviors that only appear under load or failure. If you don’t understand the underlying architecture, every problem will be an indecipherable mystery.

Documentation assumes knowledge that no one ever taught you

Read any page of the official Ceph documentation and you will find terms like:

  • Placement groups (PGs)
  • CRUSH algorithm
  • BlueStore / FileStore
  • Scrubbing and deep-scrubbing
  • Peering and recovery
  • OSDs up/down vs in/out

The documentation explains what they are, but not why they exist, what problem they solve, or how they interact with each other. It’s like trying to learn to program by starting with the language reference instead of the fundamental concepts.

Real example: A student wrote to us: “I have been 3 months trying to understand PGs. I read that ‘they are a logical abstraction’, but… why do they exist? why do they exist, why not map objects directly to OSDs?”

That question shows deep understanding. The answer (CRUSH map scalability, rebalancing granularity, metadata overhead) requires first understanding distributed systems, consistent hashing theory, and architecture trade-offs. No one teaches you that before releasing ceph osd pool create.

Constant evolution invalidates knowledge

Ceph changes FAST. And I’m not talking about optional features, but fundamental architectural changes:

  • FileStore → BlueStore (2017): Completely changes how you write to disk.
  • ceph-deploy → ceph-ansible → cephadm (2020): three different deployment tools in 5 years
  • Luminous → Nautilus → Octopus → Pacific → Quincy → Reef → Squid: 7 major versions in 7 years, each with breaking changes.
  • Crimson/Seastore (2026+): Complete rewrite of the OSD that will invalidate much of the tuning knowledge.

What you learned 2 years ago about FileStore tuning is completely irrelevant now. The PGs per pool that you were calculating manually are now managed by autoscaler. Networking best practices changed with msgr2.

Beginner (and expert) mistake #2: Learning configurations by rote without understanding why they exist. I saw an administrator manually configuring PG count with Luminous formulas…. on a Squid cluster with autoscaler enabled. The autoscaler ignored it, he didn’t understand why. Historical context matters to know what knowledge is obsolete.

How long it really takes to master Ceph

Let’s talk with real numbers based on our experience training managers:

40h
For basic functional deployment
6 months
For troubleshooting without panic
2-3 years
For real expertise in production

Realistic learning progression

A winding mountain path going up from the bottom left to the top right. The path is divided into 6 sections, each with a small flag or milestone marker. The sections are labeled with the stages from the article:

Month 1-2: “I don’t understand anything but it works”.

You follow tutorials. You launch commands ceph osd pool create, ceph osd tree. The cluster works… until it doesn’t. An OSD is marked as down and you panic because you don’t know how to diagnose.

Typical symptom: You copy commands from Stack Overflow without understanding what they do. “I fixed it” but you don’t know how or why.

Month 3-6: “I understand commands but not architecture”.

You have memorized the main commands. You know how to create pools, configure RBD, mount CephFS. But when PG 3.1f is stuck in peering comes up, you have no idea what “peering” means and how to fix it.

Typical symptom: You solve problems by trial-and-error by restarting services until it works. There is no method, there is luck.

Month 6-12: “I understand architecture but not tuning”.

You finally understand MON/OSD/MGR, the CRUSH algorithm, what PGs are. You can explain the architecture on a whiteboard. But your cluster performs poorly and you don’t know if it’s CPU, network, disks, or configuration.

Typical symptom: You read about BlueStore tuning, change parameters randomly, don’t measure before/after. Performance remains the same (or worse).

Year 1-2: “I can troubleshoot but no method”.

You have already rescued some clusters. You know how to use , interpret PG states, recover a down OSD. But every problem is a new 4 hours adventure trying things.

Typical symptom: You can fix problems… eventually. But you can’t predict them or explain to your boss how long the solution will take.

Year 2-3: “I have method and understand trade-offs”.

You diagnose systematically: collect symptoms, formulate hypotheses, validate with specific tools. You understand trade-offs: when to use replication vs. erasure coding, how to size hardware, when NVMe is worthwhile.

Typical symptom: Your response to problems is “let me check X, Y and Z” with a clear plan. You can estimate realistic recovery times.

Year 3+: real expertise

You design architectures from scratch considering workload, budget, SLAs. Make disaster recovery without manual. Optimize BlueStore for specific loads. You understand the source code enough to debug rare behaviors.

Typical symptom: Other admins call you when a cluster is “impossible”. You take 20 minutes to identify the problem that they have been attacking for 3 days.

The good news: You can SIGNIFICANTLY accelerate this progression with structured training. A good 3-day course can condense 6 months of trial-and-error. Not because it “teaches faster”, but because it saves you from dead ends and misunderstandings that consume weeks.

Typical misunderstandings that hold back your learning

Misunderstanding #1: “More hardware = more performance”.

I have seen clusters with 40 OSDs performing worse than clusters with 12. Why? Because they had:

  • Public network and cluster on the same interface (guaranteed saturation)
  • CPU frequency governor in “powersave” (5x degradation in replication)
  • PG count totally unbalanced between pools
  • BlueStore very low cache for RGW loads

The reality: Ceph performance depends on the weakest link. A single-threaded bottleneck in a MON can bring down the whole cluster. More misconfigured hardware only multiplies the chaos.

Misunderstanding #2: “Erasure coding always saves space”.

A student once proudly said: “I moved my entire cluster to erasure coding 8+3 to save space”. We asked him, “What workload do you have?” – “RBD with frequent snapshots.” Whoops.

Erasure coding with workloads that do small overwrites (RBD, CephFS) is TERRIBLE for performance. And the space “savings” is eaten up with partial stripes and metadata overhead.

The reality: EC is great for object storage cold data (RGW archives). It is terrible for block devices with high IOPS. Knowing the workload before deciding architecture is fundamental.

Misunderstanding #3: “If ceph health says HEALTH_OK, all is well.”

No. HEALTH_OK means that Ceph did not detect problems known to him. It does not detect:

  • Progressive disk degradation (SMART warnings)
  • Intermittent network packet loss
  • Memory leaks in daemons
  • Scrubbing that has not been completed for 2 weeks
  • PGs with suboptimal placement causing hotspots

The reality: You need external monitoring (Prometheus + Grafana minimum) and review metrics that Ceph does not expose on ceph health. HEALTH_OK is necessary but not sufficient.

Misunderstanding #4: “I read the official doc and that’s enough”.

The official documentation is reference material, not teaching material. Assume you already understand:

  • Distributed systems (Paxos, quorum, consensus)
  • Storage fundamentals (IOPS vs throughput, latency percentiles)
  • Networking (MTU, jumbo frames, TCP tuning)
  • Linux internals (cgroups, systemd, kernel parameters)

If you don’t bring that foundation, the doc will confuse you more than help.

The reality: You need additional resources: academic papers on distributed systems, blogs of real experiences, training that connects the dots that the doc omits.

Typical mistakes (that we all make)

Beginner’s mistakes

Do not configure cluster network: The public network is saturated with internal replication. Performance plummets. Solution: --cluster-network from day 1.

Use defaults for PG count: In pre-Pacific versions you would create pools with 8 PGs… for a pool that would grow to 50TB. Impossible to rebalance later. Solution: Autoscaler or calculate well from the beginning.

Not understanding the difference OSD up/down vs in/out: You take out an OSD for maintenance with ceph osd out and immediately start massive rebalancing that takes 8 hours. You wanted noout. Oops.

Intermediate errors

Oversized erasure coding: Configure 17+3 EC in cluster of 25 nodes. One node crashes and the cluster goes into read-only mode because there are not enough OSDs to write to. Trade-off not understood.

Ignore the I/O scheduler: Use deadline scheduler with NVMe (absurd). Or none scheduler with HDD (disastrous). The right scheduler matters 20-30% of performance.

Don’t plan disaster recovery: “We have 3x replication, we are safe”. Then a whole rack goes down and they lose quorum of MONs. They never practiced recovery. Panic.

Expert mistakes (yes, we make them too)

Over-tuning: Change 15 BlueStore parameters simultaneously “to optimize”. Something breaks. Which of the 15 changes was it? Nobody knows. Principle: change ONE thing, measure, iterate.

Relying too much on old knowledge: Applying tuning techniques from FileStore to BlueStore. Doesn’t work because the internal architecture is totally different. Historical context matters.

Not documenting architectural decisions: 2 years ago you decided to use EC 8+2 in a certain pool for X reason. Nobody documented it. Now a new admin wants to “simplify” to replication. Disaster avoidable with documentation.

The most effective way to learn Ceph

Phase 1: architectural fundamentals (40-60 hours)

Before touching a command, understand:

  • What problem does Ceph solve (vs NAS, vs SAN, vs cloud storage)?
  • RADOS architecture: how MON, OSD, MGR work
  • CRUSH algorithm: why does it exist, what problem does it solve?
  • Placement groups: the abstraction that makes the system scalable
  • Difference between pools, PGs, objects, and their mapping to OSDs

How to study this: Not with commands, but with diagrams and concepts. A good fundamentals course is 100x more effective than “deploy in 10 minutes” tutorials.

Recommended course: Ceph administration

Level: fundamental
3 days intensive

Program specifically designed to build a solid foundation from scratch. Assumes no prior knowledge of distributed storage.

See complete program →

Phase 2: advanced configuration and troubleshooting (60-80 hours)

With solid foundations, you now go deeper:

  • BlueStore internals: how the data is actually written
  • CRUSH rules customized for complex topologies
  • Performance tuning: identifying bottlenecks
  • Multi-site RGW for geo-replication
  • RBD mirroring for disaster recovery
  • Systematic troubleshooting with method

The goal: To move from “I can configure” to “I understand why I configure this way and what trade-offs I am making”.

Recommended course: Advanced Ceph

Level: advanced
3 days intensive

For administrators who already have cluster running but want to master complex configurations and prepare for EX260.

See complete program →

Phase 3: critical production operations (80-100 hours)

The final leap: from “I know how to configure and troubleshoot” to “I can rescue clusters in production at 3AM”.

  • Forensic troubleshooting: diagnosing complex multi-factor failures
  • Disaster recovery REAL: recovery of corrupted metadata, lost journals
  • Performance engineering: kernel and hardware optimization
  • Architectures for specific loads: AI/ML, video streaming, compliance
  • Security hardening and compliance (GDPR, HIPAA)
  • Scaling to petabytes: problems that only appear on a large scale

The objective: Real verifiable expertise. Eliminate that “respect” (fear) of critical scenarios.

Recommended course: Ceph production engineering

Level: expert
3 days intensive

The only course on the market focused 100% on critical production operations. No simulations – real problems.

See complete program →

Continuous practice: the non-negotiable ingredient

Here is the inconvenient truth: you can do the 3 courses and still have no expertise if you don’t practice. Theoretical knowledge is forgotten if you don’t apply it.

SIXE’s recommendation after each course:

  1. Set up a practice cluster (can be local VMs or cheap cloud).
  2. Intentionally breaks things: kills OSDs, fills disks, saturates network
  3. Practice recovery without manual: can you recover without Google?
  4. Measures everything: benchmarks before/after each change
  5. Document your learning: blog, notes, whatever.

Pro tip: The best Ceph administrators I know maintain a permanent “lab cluster” where they test crazy things. Some even have scripts that inject random bugs to practice troubleshooting. Sounds extreme, but it works.

“The difference between a middle manager and an expert is not that the expert doesn’t make mistakes. It’s that the expert orecognizes the mistake in 5 minutes instead of 5 hours, because he’s made it before and documented the solution.”

Conclusion: the road is long but can be accelerated.

If you made it this far, you are already in the top 10% of Ceph administrators out of pure intent to learn correctly. Most drop out when they realize the real complexity.

The uncomfortable truths you must accept:

  1. Mastering Ceph takes 2-3 years of continuous hands-on experience. There are no magic shortcuts.
  2. You’re going to make mistakes. Lots of them. Some in production. It’s part of the process.
  3. Knowledge depreciates quickly. What you learn today will be partially obsolete in 18 months.
  4. Official documentation will never be tutorial-friendly. You need complementary resources.

But there is also good news:

  1. Demand for Ceph experts massively exceeds supply. Good time to specialize.
  2. You can accelerate the 6-12 month learning curve with structured training that avoids dead ends.
  3. Once you “click” on the fundamental architecture, the rest is logically built on that foundation.
  4. The community is generally open to help. You are not alone.

Our final advice after 10+ years with Ceph: Start with solid fundamentals, practice constantly, and don’t be afraid to break things in test environments. The best administrators I know are the ones who have broken the most clusters (in labs) and meticulously documented every recovery.

And if you want to significantly accelerate your learning curve, consider structured training that condenses years of practical experience into intensive weeks. Not because it’s easier, but because it saves you the 6 months we all waste attacking problems that someone else has already solved.

Where to start today?

Or just set up a cluster of 3 VMs, break things, and learn troubleshooting. The path is yours, but it doesn’t have to be lonely.

 

Open source storage for AI and HPC: when Ceph is no longer an alternative but the only viable way forward

When CERN needs to store and process data from the Large Hadron Collider ( LHC, the world’s largest and most powerful particle accelerator), scale is everything. At this level, technology and economics converge in a clear conclusion: open source technologies such as Ceph, EOS and Lustre are not an “alternative” to traditional enterprise solutions; in many scenarios, they are the only viable way forward.

With more than 1 exabyte of disk storage, 7 billion files y 45 petabytes per week processed during data collection campaigns, the world’s largest particle physics laboratory is moving into a field where the classical models of capacity licensing models no longer no longer make economic sense.

This reality, documented in the paper presented at CHEP 2025, “Ceph at CERN in the multi-datacentre era”, reflects what more and more universities and research centers are finding: there are use cases where open source does not compete with enterprise solutions.reflects what more and more universities and research centers are realizing: there are use cases where open source does not compete with enterprise solutions, it defines its own categoryIt defines its own category, for which traditional architectures were simply not designed.

open source cern storage

CERN: numbers that change the rules

The CERN figures are not only impressive; they explain why certain technologies are chosen:

  • >1 exabyte of disk storage, distributed over ~2,000 servers with 60,000 disks.

  • >4 exabytes of annual transfers.

  • Up to 45 PB/week and sustained throughput >10 GB/s sustained throughput in data collection periods.

Architecture is heterogeneous by necessity:

  • EOS for physics files (more than 1 EB).

  • CTA (CERN Tape Archive) for long-term archiving.

  • Ceph (more than 60 PB) for blocks, S3 objects and CephFS, backing up OpenStack.

It is not only the volume that is relevant, but also the trajectory. In a decade, they have gone from a few petabytes to exabytes. without disruptive architectural leapsadding nodes commodity horizontally. This elasticity does not exist in the proprietary cabins with capacity licenses.

The economics of the exabyte: where capacity models fail

Current licensing models in the enterprise market are reasonable for typical environments. for typical environments (tens or hundreds of terabytes, predictable growth, balanced CapEx and OpEx). They provide integration, 24×7 support, certifications and a partner ecosystem. But at petabyte or exabyte scale with rapid growth, the equation changes.

  • At SIXE we are IBM Premier Partnerand we have evolved towards capacity-based licensing.

    • IBM Spectrum Virtualize uses Storage Capacity Units (SCU)~1 TB per SCU. The annual cost per SCU can range from 445 y 2.000 €depending on volume, customer profile and environmental conditions.

    • IBM Storage Defender uses Resource Units (RUs). For example, IBM Storage Protect consumes 17 RUs/TB for the first 100 TB and 15 RUs/TB for the next 250 TB, allowing resiliency capabilities to be combined under a unified license.

  • Similar models exist at NetApp (term-capacity licensing), Pure Storage, Dell Technologies and others: pay for managed or provisioned capacity..

All of this works in conventional enterprise environments. However, manage 60 PB under per-capacity licensing, even with high volume discounts, can translate into millions of euros per year in software alonewithout counting hardware, support or services. At that point, the question is no longer whether open source is “viable”, but rather whether it is is there any realistic alternative to it for these scales.

Technical capabilities: an already mature open source

The economic advantage would not apply if the technology were inferior. This is not the case. For certain AI and HPC loads, the capabilities are equivalent or higher:

  • Ceph offers unified storage virtualization with thin provisioning, compression at BlueStore, snapshots y COW clones without significant penalty, multisite replication (RGW and RBD), and tiering between media, and if you want your team to understand how to take advantage of Ceph, we have…

  • CERN documents multi-datacenter strategies for business continuity and disaster recovery using stretch clusters y multisite replicationwith RPO/RTO comparable to enterprise solutions.

IBM recognizes this maturity with IBM Storage Ceph (a derivative of Red Hat Ceph Storage), which combines open source technology technology with support, certifications and SLAs enterprise level. At SIXEas an IBM Premier Partnerwe implemented IBM Storage Ceph when business support is required and also Ceph upstream when flexibility and independence are prioritized.

Key architectural difference:

  • IBM Spectrum Virtualize is an enterprise layer that manages heterogeneous storage from blockwith dedicated nodes or instances, and advanced mobility, replication and automation features.

  • Ceph is a native native distributed system that serves blocks, objects and files from the same horizontal infrastructureeliminating silos. At pipelines for datasets, blocks for metadata, file shares for collaboration – this unification brings clear operational advantages clear operational advantages.

Conceptual digital illustration symbolizing mature open source storage technology. Three distinct data flows (subtly different colors) converge into a single glowing structure, symbolizing integration and scalability. The environment evokes a modern data center with soft blue and white lighting, clean geometry, and a sense of precision and reliability.

Large-scale AI and HPC: where the distributed shines

The training training of foundational models read petabytes in parallel in parallelwith aggregate bandwidths of 100 GB/s or more. The inference requires sub-10 ms latencies with thousands of concurrent requests.

Traditional architectures with SAN controllers controllers suffer bottlenecks when hundreds of GPUS (A100, H100…) access data at the same time. It is estimated that about 33 % of GPUs in corporate AI environments operate at less than 15 15 % utilization due to storage saturationwith the consequent cost in underutilized underutilized assets.

Distributed architectures architecturesCeph, Lustre, BeeGFS– were born for these patterns:

  • Luster drives 7 of the 10 supercomputers in the Top500supercomputers, with >1 TB/s aggregate throughput in large installations. Frontier (ORNL) uses ~700 PB in Lustre and writes >35 TB/s sustained.

  • BeeGFS scales storage and metadata independently independentlyexceeding 50 GB/s sustained with tens of thousands of customers in production.

  • MinIOoptimized for objects in AI, has demonstrated >2.2 TiB/s read performance in training, difficult to match by centralized architectures.

Integration with GPU has also matured: GPUDirect Storage allows GPUs to read from NVMe-oF without passing through the CPU, reducing latency and freeing up cycles. Modern open source systems support these protocols. nativelyin proprietary solutions, they often rely on firmware y certifications that take quarters to arrive.

SIXE: sustainable open source, with or without commercial support

Migrating to large-scale open source storage is not trivial. Distributed systems require specific experience.

At SIXE we have been more than 20 years with Linux y open source. Like IBM Premier Partnerwe offer the best of both worlds:

  • IBM Storage Ceph e IBM Storage Scale (formerly Spectrum Scale/GPFS) for those who need Guaranteed SLAs, certifications y 24×7 global support.

  • Ceph upstream (and related technologies) for organizations that prefer maximum flexibility and control maximum flexibility and control.

It is not a contradictory position, but a strategic strategicdifferent profiles, different needs. A multinational bank values certifications and enterprise support. A research center with strong technical equipment can operate upstream directly.

Our intensive training at Ceph are hands-on workshops from three-dayThe workshops are three days: real clusters are deployed and design decisions. Knowledge transfer reduces the dependence on consultants and empower to the internal team. If your team still has little experience with Ceph, click here to see our initiation course, if on the other hand you want to get the most out of Ceph, we leave you here the advanced Ceph course, where your team will be able to integrate two crucial technological factors right now: Storage + AI.

 

Our philosophyWe do not sell technology, we transfer capacity. We deploy IBM Storage Ceph with full support, Ceph upstream with our specialized support or hybrid approacheson a case-by-case basis.

The opportunity for massive data and science

Several factors align:

  • The data is growing exponentially: a NovaSeq X Plus can generate 16 TB per run; the SKA telescope telescope will produce exabytes per yearAI models demand datasets datasets.

  • The budgets do not grow at the same pace. The capacity licensing models make unfeasible to scale proprietary systems at the required pace.

Open source solutions, whether upstream o commercially supported (e.g., IBM Storage Ceph), eliminates this dichotomy: growth is planned by hardware cost y operational capacitywith software whose costs do not do not scale linearly per terabyte.

Centers such as Fermilab, DESY, CERN itself CERN or the Barcelona Supercomputing Center have demonstrated that this approach is technically feasible y operationally superior for their cases. In its recent paper, CERN details multi-datacenter for DR with Ceph (stretch and multisite), achieving availability comparable to enterprise solutions, with flexibility and total control.

A maturing ecosystem: planning now

The open source storage ecosystem for HPC e AI is evolving fast:

  • Ceph Foundation (Linux Foundation) coordinates contributions from CERN, Bloomberg, DigitalOcean, OVH, IBMamong others, aligned with real production needs.

  • IBM maintains IBM Storage Ceph as a supported product and actively contributes upstream.

It is the ideal confluence between open source innovation y enterprise support. For organizations with a horizon of decadesthe question is no longer whether adopt open source, but when and when and how do so in a structured structured way.

The technology is matureThe technology is mature, the success stories are documented and support exists in both community community and commercial. What is often missing is the expertise to draw up the roadmap: model (upstream, commercial or hybrid), sizing, training y sustainable operation.

SIXE: your partner towards a storage that grows with you

At SIXE we work at that intersection. Like IBM Premier Partnerwe gain access to world-class support, roadmaps y certifications. At the same time, we maintain deep expertise in upstream and other ecosystem technologies, because there is no one-size-fits-all solution one-size-fits-all solution.

When a center contacts us, we don’t start with the catalog catalogbut with the key key questions:

  • What are your access patterns?

  • What growth project?

  • What capabilities does your equipment have?

  • What are the risks can you assume?

  • What is the budget budget (CapEx/OpEx)?

The answers guide the recommendation: IBM Storage Ceph with enterprise support, upstream with our support, a hybrid, or even evaluate if a traditional solution still makes sense in your case. We design solutions that work for 5 and 10 years, the important thing for us is to create durable and sustainable solutions over time ;)

Our commitment is to sustainable technologiestechnologies, not subject to commercial fluctuations, that provide control over infrastructure and scale technically and technically and economically.

The case of CERN is not an academic curiosity: it shows where storage for data-intensive loads is going. data-intensive loads. The question is not whether your organization will get there, but whether it will how will arrive: ready o on the run. The window of opportunity to plan calmly is open. open. The successes exist. The technology is ready. The ecosystem also. It remains to take the strategic decision to invest in infrastructures that will accompany your organization for decades to come. decades of data growth.

Contact us!

Does your organization generate massive volumes of data for AI o research? At SIXE we help research centers, universities, and innovative organizations to design, implement and operate storage scalable with Ceph, Storage Scale and other leading technologies, both upstream as with IBM business supportaccording to your needs. Contact us at for a no-obligation strategic consultation.

References

See you at Common Iberia 2025!

SIXE’s team will attend as part of Common Iberia 2025. We will be back in Madrid on November 13th and 14th for the reference event of the IBM i, AIX and Power ecosystem. Two days dedicated to the latest developments in Power technology, from the announcement of Power 11 to real AI use cases, with international experts, IBM Champions and community leaders.

Click on the image to access the event registration form.

Our sessions at Common Iberia Madrid:

Document Intelligence on IBM Power with Docling and Granite
Discover how to implement advanced document intelligence directly into your Power infrastructure.

Common Iberia 2025

AIX 7.3 news and best practices: performance, availability and security
Everything you need to know about the latest AIX 7.3 capabilities to optimize your critical systems.

Common Iberia 2025

Ubuntu on Power: containers, AI, DB and other 100% open source wonders
Explore the possibilities of the open source ecosystem on Power architectures.

Ubuntu at Power Common Iberia 2025

ILE RPG – Using IBM i Services (SQL) and QSYS2 SQL Functions
Learn how to take full advantage of native IBM i SQL services in your RPG applications.

Common Iberia 2025

In addition to the presentation of Project BOB (IBM’s integrated development assistant), the event includes sessions on AI, high availability, PowerVS, modern development with VS Code, and an open discussion on AI use cases in IBM i.


✅Reserveyour place now

Connect with the IBM Power community, share success stories and discover the latest innovations in critical systems – we look forward to seeing you in Madrid!

SIXE