Porting MariaDB to IBM AIX (Part 2): how AIX matches Linux

How I Closed a 23x Performance Gap (Spoiler: There Was No Gap)

Part 2: From “AIX is Slow” to “AIX Matches Linux”

In Part 1, I wrestled with CMake, implemented a thread pool from scratch, and shipped a stable MariaDB 11.8.5 for AIX. The server passed 1,000 concurrent connections, 11 million queries, and zero memory leaks.

Then I ran a vector search benchmark.

AIX: 42 queries per second.
Linux (same hardware): 971 queries per second.

Twenty-three times slower. On identical IBM Power S924 hardware. Same MariaDB version. Same dataset.

This is the story of how we discovered there was no performance gap at all — just configuration mistakes and a suboptimal compiler.

Chapter 1: The Sinking Feeling

There’s a particular kind of despair that comes from seeing a 23x performance gap on identical hardware. It’s the “maybe I should have become a florist” kind of despair.

Let me set the scene: both machines are LPARs running on IBM Power S924 servers with POWER9 processors at 2750 MHz. Same MariaDB 11.8.5. Same test dataset — 100,000 vectors with 768 dimensions, using MariaDB’s MHNSW (Hierarchical Navigable Small World) index for vector search.

The benchmark was simple: find the 10 nearest neighbors to a query vector. The kind of operation that powers every AI-enhanced search feature you’ve ever used.

Linux did it in about 1 millisecond. AIX took 24 milliseconds.

My first instinct was denial. “The benchmark must be wrong.” It wasn’t. “Maybe the index is corrupted.” It wasn’t. “Perhaps the network is slow.” It was a local socket connection.

Time to dig in.

Chapter 2: The First 65x — Configuration Matters

The Cache That Forgot Everything

The first clue came from MariaDB’s profiler. Every single query was taking the same amount of time, whether it was the first or the hundredth. That’s not how caches work.

I checked MariaDB’s MHNSW configuration:

SHOW VARIABLES LIKE 'mhnsw%';
mhnsw_max_cache_size: 16777216

16 MB. Our vector graph needs about 300 MB to hold the HNSW structure in memory.

Here’s the kicker: when the cache fills up, MariaDB doesn’t evict old entries (no LRU). It throws everything away and starts fresh. Every. Single. Query.

Imagine a library where, when the shelves get full, the librarian burns all the books and orders new copies. For every patron.

Fix: mhnsw_max_cache_size = 4GB in the server configuration.

Result: 42 QPS → 112 QPS. 2.7x improvement from one config line.

The Page Size Problem

AIX defaults to 4 KB memory pages. Linux on POWER uses 64 KB pages.

For MHNSW’s access pattern — pointer-chasing across a 300 MB graph — this matters enormously. With 4 KB pages, you need 16x more TLB (Translation Lookaside Buffer) entries to map the same amount of memory. TLB misses are expensive.

Think of it like navigating a city. With 4 KB pages, you need directions for every individual building. With 64 KB pages, you get directions by neighborhood. Much faster when you’re constantly jumping around.

Fix: Wrapper script that sets LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STACKPSIZE=64K@SHMPSIZE=64K

Result: 112 QPS → 208 QPS sequential, and 2,721 QPS with 12 parallel workers.

The Scoreboard After Phase 1

ConfigurationSequential QPSWith 12 Workers
Baseline42~42
+ 4GB cache112
+ 64K pages2082,721

65x improvement from two configuration changes. No code modifications.

But we were still 6x slower than Linux per-core. The investigation continued.

Chapter 3: The CPU vs Memory Stall Mystery

With configuration fixed, I pulled out the profiling tools. MariaDB has a built-in profiler that breaks down query time by phase.

AIX:

Sending data: 4.70ms total
  - CPU_user: 1.41ms
  - CPU_system: ~0ms
  - Stalls: 3.29ms (70% of total!)

Linux:

Sending data: 0.81ms total
  - CPU_user: 0.80ms
  - Stalls: ~0.01ms (1% of total)

The CPU execution time was 1.8x slower on AIX — explainable by compiler differences. But the memory stalls were 329x worse.

The Root Cause: Hypervisor Cache Invalidation

Here’s something that took me two days to figure out: in a shared LPAR (Logical Partition), the POWER hypervisor periodically preempts virtual processors to give time to other partitions. When it does this, it may invalidate L2/L3 cache lines.

MHNSW’s graph traversal is pointer-chasing across 300 MB of memory — literally the worst-case scenario for cache invalidation. You’re jumping from node to node, each in a different part of memory, and the hypervisor is periodically flushing your cache.

It’s like trying to read a book while someone keeps closing it and putting it back on the shelf.

The Linux system had dedicated processors. The AIX system was running shared. Not apples to apples.

But before I could test dedicated processors, I needed to fix the compiler problem.

Chapter 4: The Compiler Odyssey

Everything I Tried With GCC (And Why It Failed)

AttemptResultWhy
-flto (Link Time Optimization)ImpossibleGCC LTO requires ELF format; AIX uses XCOFF
-fprofile-generate (PGO)Build failsTOC-relative relocation assembler errors
-ffast-mathBreaks everythingIEEE float violations corrupt bloom filter hashing
-funroll-loopsSlowerInstruction cache bloat — POWER9 doesn’t like it
-finline-functionsSlowerSame I-cache problem

The AIX Toolbox GCC is built without LTO support. It’s not a flag you forgot — it’s architecturally impossible because GCC’s LTO implementation requires ELF, and AIX uses XCOFF.

Ubuntu’s MariaDB packages use -flto=auto. That optimization simply doesn’t exist for AIX with GCC.

IBM Open XL: The Plot Twist

At this point, I’d spent three days trying to make GCC faster. Time to try something different.

IBM Open XL C/C++ 17.1.3 is IBM’s modern compiler, based on LLVM/Clang. It generates significantly better code for POWER9 than GCC.

Building MariaDB with Open XL required solving five different problems:

  1. Missing HTM header: Open XL doesn’t have GCC’s htmxlintrin.h. I created a stub.
  2. 32-bit AR by default: AIX tools default to 32-bit. Set OBJECT_MODE=64.
  3. Incompatible LLVM AR: Open XL’s AR couldn’t handle XCOFF. Used system /usr/bin/ar.
  4. OpenSSL conflicts: Used -DWITH_SSL=system to avoid bundled wolfSSL issues.
  5. Missing library paths: Explicit -L/opt/freeware/lib for the linker.

Then I ran the benchmark:

Compiler30 QueriesPer-Query
GCC 13.3.00.190s6.3ms
Open XL 17.1.30.063s2.1ms

Three times faster. Same source code. Same optimization flags (-O3 -mcpu=power9).

And here’s the bonus: GCC’s benchmark variance was 10-40% between runs. Open XL’s variance was under 2%. Virtually no jitter.

Why Such a Huge Difference?

Open XL (being LLVM-based) has:

  • Better instruction scheduling for POWER9’s out-of-order execution
  • Superior register allocation
  • More aggressive optimization passes

GCC’s POWER/XCOFF backend simply isn’t as mature. The AIX Toolbox GCC is functional, but it’s not optimized for performance-critical workloads.

Chapter 5: The LTO and PGO Dead Ends

Hope springs eternal. Maybe Open XL’s LTO and PGO would work?

LTO: The Irony

Open XL supports -flto=full on XCOFF. It actually builds! But…

Result: 27% slower than non-LTO Open XL.

Why? AIX shared libraries require an explicit export list (exports.exp). With LTO, CMake’s script saw ~27,000 symbols to export.

LTO’s main benefit is internalizing functions — marking them as local so they can be optimized away or inlined. When you’re forced to export 27,000 symbols, none of them can be internalized. The LTO overhead (larger intermediate files, slower link) remains, but the benefit disappears.

It’s like paying for a gym membership and then being told you can’t use any of the equipment.

PGO: The Profiles That Never Were

Profile-Guided Optimization sounded promising:

  1. Build with -fprofile-generate
  2. Run training workload
  3. Rebuild with -fprofile-use
  4. Enjoy faster code

Step 1 worked. Step 2… the profiles never appeared.

I manually linked the LLVM profiling runtime into the shared library. Still no profiles.

The root cause: LLVM’s profiling runtime uses atexit() or __attribute__((destructor)) to write profiles on exit. On AIX with XCOFF, shared library destructor semantics are different from ELF. The handler simply isn’t called reliably for complex multi-library setups like MariaDB.

Simple test cases work. Real applications don’t.

Chapter 6: The LPAR Revelation

Now I had a fast compiler. Time to test dedicated processors and eliminate the hypervisor cache invalidation issue.

The Test Matrix

LPAR ConfigGCCOpen XL
12 shared vCPUs0.190s0.063s
12 dedicated capped0.205s0.082s
21 dedicated capped0.320s0.067s

Wait. Shared is faster than dedicated?

The WoF Factor

POWER9 has a feature called Workload Optimized Frequency (WoF). In shared mode with low utilization, a single core can boost to ~3.8 GHz. Dedicated capped processors are locked at 2750 MHz.

For a single-threaded query, shared mode gets 38% more clock speed. That beats the cache invalidation penalty for this workload.

Think of it like choosing between a sports car on a highway with occasional traffic (shared) versus a truck with a reserved lane but a speed limit (dedicated capped).

The PowerVM Donating Mode Disaster

There’s a third option: dedicated processors in “Donating” mode, which donates idle cycles back to the shared pool.

ModeGCCOpen XL
Capped0.205s0.082s
Donating0.325s0.085s

60% regression with GCC.

Every time a query bursts, there’s latency reclaiming the donated cycles. For bursty, single-threaded workloads like database queries, this is devastating.

Recommendation: Never use Donating mode for database workloads.

The 21-Core Sweet Spot

With 21 dedicated cores (versus Linux’s 24), Open XL achieved 0.067s — nearly matching the 0.063s from shared mode. The extra L3 cache from more cores compensates for the lack of WoF frequency boost.

Chapter 7: The Final Scoreboard (Plot Twist)

Fresh benchmarks on identical POWER9 hardware, January 2026:

PlatformCores30 Queries
Linux24 dedicated0.057s
AIX + Open XL12 shared0.063s
AIX + Open XL21 dedicated0.067s
AIX + GCC12 shared0.190s
AIX + GCC21 dedicated0.320s

Wait. The AIX system has 21 cores vs Linux’s 24. That’s 12.5% fewer cores, which means 12.5% less L3 cache.

The measured “gap”? 10-18%.

That’s not a performance gap. That’s a hardware difference.

With IBM Open XL, AIX delivers identical per-core performance to Linux. The 23x gap we started with? It was never about AIX being slow. It was:

  1. A misconfigured cache (16MB instead of 4GB)
  2. Wrong page sizes (4KB instead of 64KB)
  3. The wrong compiler (GCC instead of Open XL)

The “AIX is slow” myth is dead.

The Complete Failure Museum

Science isn’t just about what works — it’s about documenting what doesn’t. Here’s our wall of “nice try, but no”:

What We TriedResultNotes
mhnsw_max_cache_size = 4GB5x fasterEliminates cache thrashing
LDR_CNTRL 64K pages~40% fasterReduces TLB misses
MAP_ANON_64K mmap patch~8% fasterMinor TLB improvement
IBM Open XL 17.1.33x fasterBetter POWER9 codegen
Shared LPAR (vs dedicated)~25% fasterWoF frequency boost
Open XL + LTO27% slowerAIX exports conflict
Open XL + PGODoesn’t workProfiles not written
GCC LTOImpossibleXCOFF not supported
GCC PGOBuild failsTOC relocation errors
-ffast-mathBreaks MHNSWFloat corruption
-funroll-loopsWorseI-cache bloat
POWER VSX bloom filter41% slowerNo 64-bit vec multiply on P9
Software prefetchNo effectHypervisor evicts prefetched data
DSCR tuningBlockedHypervisor controls DSCR in shared LPAR
Donating mode60% regressionNever use for databases

The VSX result is particularly interesting: we implemented a SIMD bloom filter using POWER’s vector extensions. It was 41% slower than scalar. POWER9 has no 64-bit vector multiply — you need vec_extract → scalar multiply → vec_insert for each lane, which is slower than letting the Out-of-Order engine handle a scalar loop.

What I Learned

1. Defaults Matter More Than You Think

A 16 MB cache default turned sub-millisecond queries into 24ms queries. That’s a 24x penalty from one misconfigured parameter.

When you’re porting software, question every default. What works on Linux might not work on your platform.

2. The “AIX is Slow” Myth Was Always a Toolchain Issue

With GCC, we were 3-4x slower than Linux. With Open XL, we match Linux per-core.

The platform was never slow. The default toolchain just wasn’t optimized for performance-critical workloads. Choose the right compiler.

3. Virtualization Has Hidden Trade-offs

Shared LPAR can be faster than dedicated for single-threaded workloads (WoF frequency boost). Dedicated is better for sustained multi-threaded throughput. Donating mode is a trap.

Know your workload. Choose your LPAR configuration accordingly.

4. Not Every Optimization Ports

LTO, PGO, and SIMD vectorization all failed on AIX for various reasons. The techniques that make Linux fast don’t always translate.

Sometimes the “obvious” optimization is the wrong choice. Measure everything.

5. Sometimes There’s No Gap At All

We spent days investigating a “performance gap” that turned out to be:

  • Configuration mistakes
  • Wrong compiler
  • Fewer cores on the test system

The lesson: verify your baselines. Make sure you’re comparing apples to apples before assuming there’s a problem to solve.

Recommendations

For AIX MariaDB Users

  1. Use the Open XL build (Release 3, coming soon)
  2. Set mhnsw_max_cache_size to at least 4GB for vector search
  3. Keep shared LPAR for single-query latency
  4. Never use Donating mode for databases
  5. Use 64K pages via the LDR_CNTRL wrapper

For Upstream MariaDB

  1. Increase mhnsw_max_cache_size default — 16MB is far too small
  2. Implement LRU eviction — discarding the entire cache on overflow is brutal
  3. Don’t add POWER VSX bloom filter — scalar is faster on POWER9

What’s Next

The RPMs are published at aix.librepower.org. Release 2 includes the configuration fixes. Release 3 with Open XL builds is coming once we secure a commercial license (the evaluation expires in 59 days).

Immediate priorities:

  • Commercial Open XL license: Evaluation expires soon. Need IBM license for Release 3 production RPM.
  • Native AIO implementation: AIX has POSIX AIO and Windows-compatible IOCP. Time to write the InnoDB backend.
  • Upstream MHNSW feedback: The default mhnsw_max_cache_size of 16MB is too small for real workloads; we’ll suggest a larger default.

For organizations already running mission-critical workloads on AIX — and there are many, from banks to airlines to healthcare systems — the option to also run modern, high-performance MariaDB opens new possibilities.

AIX matches Linux. The myth is dead. And MariaDB on AIX is ready for production.

TL;DR

  • Started with 23x performance gap (42 QPS vs 971 QPS)
  • Fixed cache config: 5x improvement
  • Fixed page size: ~40% more
  • Switched to IBM Open XL: 3x improvement over GCC
  • Used shared LPAR: ~25% faster than dedicated (WoF boost)
  • Final result: NO GAP — 10% difference = 12.5% fewer cores (21 vs 24)
  • AIX matches Linux per-core performance with Open XL
  • Open XL LTO: doesn’t help (27% slower)
  • Open XL PGO: doesn’t work (AIX XCOFF issue)
  • POWER VSX SIMD: 41% slower than scalar (no 64-bit vec multiply)
  • Donating mode: 60% regression — never use for databases
  • “AIX is slow for Open Source DBs” was always a toolchain myth

Questions? Ideas? Running MariaDB on AIX and want to share your experience?

This work is part of LibrePower – Unlocking IBM Power Systems through open source. Unmatched RAS. Superior TCO. Minimal footprint 🌍

LibrePower AIX project repository: gitlab.com/librepower/aix

SIXE