Porting MariaDB to IBM AIX (Part 2): how AIX matches Linux

From “AIX is Slow” to “AIX Matches Linux” (with the right tools and code)

In Part 1, I wrestled with CMake, implemented a thread pool from scratch, and shipped a stable MariaDB 11.8.5 for AIX. The server passed 1,000 concurrent connections, 11 million queries, and zero memory leaks.

Then I ran a vector search benchmark.

AIX: 42 queries per second.
Linux (same hardware): 971 queries per second.

Twenty-three times slower. On identical IBM Power S924 hardware. Same MariaDB version. Same dataset.

This is the story of how we discovered there was no performance gap at all — just configuration mistakes and a suboptimal compiler.

Chapter 1: The Sinking Feeling

There’s a particular kind of despair that comes from seeing a 23x performance gap on identical hardware. It’s the “maybe I should have become a florist” kind of despair.

Let me set the scene: both machines are LPARs running on IBM Power S924 servers with POWER9 processors at 2750 MHz. Same MariaDB 11.8.5. Same test dataset — 100,000 vectors with 768 dimensions, using MariaDB’s MHNSW (Hierarchical Navigable Small World) index for vector search.

The benchmark was simple: find the 10 nearest neighbors to a query vector. The kind of operation that powers every AI-enhanced search feature you’ve ever used.

Linux did it in about 1 millisecond. AIX took 24 milliseconds.

My first instinct was denial. “The benchmark must be wrong.” It wasn’t. “Maybe the index is corrupted.” It wasn’t. “Perhaps the network is slow.” It was a local socket connection.

Time to dig in.

Chapter 2: The First 65x — Configuration Matters

The Cache That Forgot Everything

The first clue came from MariaDB’s profiler. Every single query was taking the same amount of time, whether it was the first or the hundredth. That’s not how caches work.

I checked MariaDB’s MHNSW configuration:

SHOW VARIABLES LIKE 'mhnsw%';

mhnsw_max_cache_size: 16777216

16 MB. Our vector graph needs about 300 MB to hold the HNSW structure in memory.

Here’s the kicker: when the cache fills up, MariaDB doesn’t evict old entries (no LRU). It throws everything away and starts fresh. Every. Single. Query.

Imagine a library where, when the shelves get full, the librarian burns all the books and orders new copies. For every patron.

Fix: mhnsw_max_cache_size = 4GB in the server configuration.

Result: 42 QPS → 112 QPS. 2.7x improvement from one config line.

The Page Size Problem

AIX defaults to 4 KB memory pages. Linux on POWER uses 64 KB pages.

For MHNSW’s access pattern — pointer-chasing across a 300 MB graph — this matters enormously. With 4 KB pages, you need 16x more TLB (Translation Lookaside Buffer) entries to map the same amount of memory. TLB misses are expensive.

Think of it like navigating a city. With 4 KB pages, you need directions for every individual building. With 64 KB pages, you get directions by neighborhood. Much faster when you’re constantly jumping around.

Fix: Wrapper script that sets LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STACKPSIZE=64K@SHMPSIZE=64K

Result: 112 QPS → 208 QPS sequential, and 2,721 QPS with 12 parallel workers.

The Scoreboard After Phase 1

Configuration	Sequential QPS	With 12 Workers
Baseline	42	~42
+ 4GB cache	112	—
+ 64K pages	208	2,721

65x improvement from two configuration changes. No code modifications.

But we were still 6x slower than Linux per-core. The investigation continued.

Chapter 3: The CPU vs Memory Stall Mystery

With configuration fixed, I pulled out the profiling tools. MariaDB has a built-in profiler that breaks down query time by phase.

AIX:

Sending data: 4.70ms total
  - CPU_user: 1.41ms
  - CPU_system: ~0ms
  - Stalls: 3.29ms (70% of total!)

Linux:

Sending data: 0.81ms total
  - CPU_user: 0.80ms
  - Stalls: ~0.01ms (1% of total)

The CPU execution time was 1.8x slower on AIX — explainable by compiler differences. But the memory stalls were 329x worse.

The Root Cause: Hypervisor Cache Invalidation

Here’s something that took me two days to figure out: in a shared LPAR (Logical Partition), the POWER hypervisor periodically preempts virtual processors to give time to other partitions. When it does this, it may invalidate L2/L3 cache lines.

MHNSW’s graph traversal is pointer-chasing across 300 MB of memory — literally the worst-case scenario for cache invalidation. You’re jumping from node to node, each in a different part of memory, and the hypervisor is periodically flushing your cache.

It’s like trying to read a book while someone keeps closing it and putting it back on the shelf.

The Linux system had dedicated processors. The AIX system was running shared. Not apples to apples.

But before I could test dedicated processors, I needed to fix the compiler problem.

Chapter 4: The Compiler Odyssey

Everything I Tried With GCC (And Why It Failed)

Attempt	Result	Why
`-flto` (Link Time Optimization)	Impossible	GCC LTO requires ELF format; AIX uses XCOFF
`-fprofile-generate` (PGO)	Build fails	TOC-relative relocation assembler errors
`-ffast-math`	Breaks everything	IEEE float violations corrupt bloom filter hashing
`-funroll-loops`	Slower	Instruction cache bloat — POWER9 doesn’t like it
`-finline-functions`	Slower	Same I-cache problem

The AIX Toolbox GCC is built without LTO support. It’s not a flag you forgot — it’s architecturally impossible because GCC’s LTO implementation requires ELF, and AIX uses XCOFF.

Ubuntu’s MariaDB packages use -flto=auto. That optimization simply doesn’t exist for AIX with GCC.

IBM Open XL: The Plot Twist

At this point, I’d spent three days trying to make GCC faster. Time to try something different.

IBM Open XL C/C++ 17.1.3 is IBM’s modern compiler, based on LLVM/Clang. It generates significantly better code for POWER9 than GCC.

Building MariaDB with Open XL required solving five different problems:

Missing HTM header: Open XL doesn’t have GCC’s htmxlintrin.h. I created a stub.
32-bit AR by default: AIX tools default to 32-bit. Set OBJECT_MODE=64.
Incompatible LLVM AR: Open XL’s AR couldn’t handle XCOFF. Used system /usr/bin/ar.
OpenSSL conflicts: Used -DWITH_SSL=system to avoid bundled wolfSSL issues.
Missing library paths: Explicit -L/opt/freeware/lib for the linker.

Then I ran the benchmark:

Compiler	30 Queries	Per-Query
GCC 13.3.0	0.190s	6.3ms
Open XL 17.1.3	0.063s	2.1ms

Three times faster. Same source code. Same optimization flags (-O3 -mcpu=power9).

And here’s the bonus: GCC’s benchmark variance was 10-40% between runs. Open XL’s variance was under 2%. Virtually no jitter.

Why Such a Huge Difference?

Open XL (being LLVM-based) has:

Better instruction scheduling for POWER9’s out-of-order execution
Superior register allocation
More aggressive optimization passes

GCC’s POWER/XCOFF backend simply isn’t as mature. The AIX Toolbox GCC is functional, but it’s not optimized for performance-critical workloads.

Chapter 5: The LTO and PGO Dead Ends

Hope springs eternal. Maybe Open XL’s LTO and PGO would work?

LTO: The Irony

Open XL supports -flto=full on XCOFF. It actually builds! But…

Result: 27% slower than non-LTO Open XL.

Why? AIX shared libraries require an explicit export list (exports.exp). With LTO, CMake’s script saw ~27,000 symbols to export.

LTO’s main benefit is internalizing functions — marking them as local so they can be optimized away or inlined. When you’re forced to export 27,000 symbols, none of them can be internalized. The LTO overhead (larger intermediate files, slower link) remains, but the benefit disappears.

It’s like paying for a gym membership and then being told you can’t use any of the equipment.

PGO: The Profiles That Never Were

Profile-Guided Optimization sounded promising:

Build with -fprofile-generate
Run training workload
Rebuild with -fprofile-use
Enjoy faster code

Step 1 worked. Step 2… the profiles never appeared.

I manually linked the LLVM profiling runtime into the shared library. Still no profiles.

The root cause: LLVM’s profiling runtime uses atexit() or __attribute__((destructor)) to write profiles on exit. On AIX with XCOFF, shared library destructor semantics are different from ELF. The handler simply isn’t called reliably for complex multi-library setups like MariaDB.

Simple test cases work. Real applications don’t.

Chapter 6: The LPAR Revelation

Now I had a fast compiler. Time to test dedicated processors and eliminate the hypervisor cache invalidation issue.

The Test Matrix

LPAR Config	GCC	Open XL
12 shared vCPUs	0.190s	0.063s
12 dedicated capped	0.205s	0.082s
21 dedicated capped	0.320s	0.067s

Wait. Shared is faster than dedicated?

The WoF Factor

POWER9 has a feature called Workload Optimized Frequency (WoF). In shared mode with low utilization, a single core can boost to ~3.8 GHz. Dedicated capped processors are locked at 2750 MHz.

For a single-threaded query, shared mode gets 38% more clock speed. That beats the cache invalidation penalty for this workload.

Think of it like choosing between a sports car on a highway with occasional traffic (shared) versus a truck with a reserved lane but a speed limit (dedicated capped).

The PowerVM Donating Mode Disaster

There’s a third option: dedicated processors in “Donating” mode, which donates idle cycles back to the shared pool.

Mode	GCC	Open XL
Capped	0.205s	0.082s
Donating	0.325s	0.085s

60% regression with GCC.

Every time a query bursts, there’s latency reclaiming the donated cycles. For bursty, single-threaded workloads like database queries, this is devastating.

Recommendation: Never use Donating mode for database workloads.

The 21-Core Sweet Spot

With 21 dedicated cores (versus Linux’s 24), Open XL achieved 0.067s — nearly matching the 0.063s from shared mode. The extra L3 cache from more cores compensates for the lack of WoF frequency boost.

Chapter 7: The Final Scoreboard (Plot Twist)

Fresh benchmarks on identical POWER9 hardware, January 2026:

Platform	Cores	30 Queries
Linux	24 dedicated	0.057s
AIX + Open XL	12 shared	0.063s
AIX + Open XL	21 dedicated	0.067s
AIX + GCC	12 shared	0.190s
AIX + GCC	21 dedicated	0.320s

Wait. The AIX system has 21 cores vs Linux’s 24. That’s 12.5% fewer cores, which means 12.5% less L3 cache.

The measured “gap”? 10-18%.

That’s not a performance gap. That’s a hardware difference.

With IBM Open XL, AIX delivers identical per-core performance to Linux. The 23x gap we started with? It was never about AIX being slow. It was:

A misconfigured cache (16MB instead of 4GB)
Wrong page sizes (4KB instead of 64KB)
The wrong compiler (GCC instead of Open XL)

The “AIX is slow” myth is dead.

The Complete Failure Museum

Science isn’t just about what works — it’s about documenting what doesn’t. Here’s our wall of “nice try, but no”:

What We Tried	Result	Notes
`mhnsw_max_cache_size = 4GB`	5x faster	Eliminates cache thrashing
`LDR_CNTRL` 64K pages	~40% faster	Reduces TLB misses
`MAP_ANON_64K` mmap patch	~8% faster	Minor TLB improvement
IBM Open XL 17.1.3	3x faster	Better POWER9 codegen
Shared LPAR (vs dedicated)	~25% faster	WoF frequency boost
Open XL + LTO	27% slower	AIX exports conflict
Open XL + PGO	Doesn’t work	Profiles not written
GCC LTO	Impossible	XCOFF not supported
GCC PGO	Build fails	TOC relocation errors
`-ffast-math`	Breaks MHNSW	Float corruption
`-funroll-loops`	Worse	I-cache bloat
POWER VSX bloom filter	41% slower	No 64-bit vec multiply on P9
Software prefetch	No effect	Hypervisor evicts prefetched data
DSCR tuning	Blocked	Hypervisor controls DSCR in shared LPAR
Donating mode	60% regression	Never use for databases

The VSX result is particularly interesting: we implemented a SIMD bloom filter using POWER’s vector extensions. It was 41% slower than scalar. POWER9 has no 64-bit vector multiply — you need vec_extract → scalar multiply → vec_insert for each lane, which is slower than letting the Out-of-Order engine handle a scalar loop.

What I Learned

1. Defaults Matter More Than You Think

A 16 MB cache default turned sub-millisecond queries into 24ms queries. That’s a 24x penalty from one misconfigured parameter.

When you’re porting software, question every default. What works on Linux might not work on your platform.

2. The “AIX is Slow” Myth Was Always a Toolchain Issue

With GCC, we were 3-4x slower than Linux. With Open XL, we match Linux per-core.

The platform was never slow. The default toolchain just wasn’t optimized for performance-critical workloads. Choose the right compiler.

3. Virtualization Has Hidden Trade-offs

Shared LPAR can be faster than dedicated for single-threaded workloads (WoF frequency boost). Dedicated is better for sustained multi-threaded throughput. Donating mode is a trap.

Know your workload. Choose your LPAR configuration accordingly.

4. Not Every Optimization Ports

LTO, PGO, and SIMD vectorization all failed on AIX for various reasons. The techniques that make Linux fast don’t always translate.

Sometimes the “obvious” optimization is the wrong choice. Measure everything.

5. Sometimes There’s No Gap At All

We spent days investigating a “performance gap” that turned out to be:

Configuration mistakes
Wrong compiler
Fewer cores on the test system

The lesson: verify your baselines. Make sure you’re comparing apples to apples before assuming there’s a problem to solve.

Recommendations

For AIX MariaDB Users

Use the Open XL build (Release 3, coming soon)
Set mhnsw_max_cache_size to at least 4GB for vector search
Keep shared LPAR for single-query latency
Never use Donating mode for databases
Use 64K pages via the LDR_CNTRL wrapper

For Upstream MariaDB

Increase mhnsw_max_cache_size default — 16MB is far too small
Implement LRU eviction — discarding the entire cache on overflow is brutal
Don’t add POWER VSX bloom filter — scalar is faster on POWER9

What’s Next

The RPMs are published at aix.librepower.org. Release 2 includes the configuration fixes. Release 3 with Open XL build is also available.

Immediate priorities:

Commercial Open XL license: Evaluation expires soon. Need to verify with IBM if we are ok using xLC for this purpose.
Native AIO implementation: AIX has POSIX AIO and Windows-compatible IOCP. Time to write the InnoDB backend.
Upstream MHNSW feedback: The default mhnsw_max_cache_size of 16MB is too small for real workloads; we’ll suggest a larger default.

For organizations already running mission-critical workloads on AIX — and there are many, from banks to airlines to healthcare systems — the option to also run modern, high-performance MariaDB opens new possibilities.

AIX matches Linux. The myth is dead. And MariaDB on AIX is ready for production.

TL;DR

Started with 23x performance gap (42 QPS vs 971 QPS)
Fixed cache config: 5x improvement
Fixed page size: ~40% more
Switched to IBM Open XL: 3x improvement over GCC
Used shared LPAR: ~25% faster than dedicated (WoF boost)
Final result: NO GAP — 10% difference = 12.5% fewer cores (21 vs 24)
AIX matches Linux per-core performance with Open XL
Open XL LTO: doesn’t help (27% slower)
Open XL PGO: doesn’t work (AIX XCOFF issue)
POWER VSX SIMD: 41% slower than scalar (no 64-bit vec multiply)
Donating mode: 60% regression — never use for databases
“AIX is slow for Open Source DBs” was always a toolchain myth

Questions? Ideas? Running MariaDB on AIX and want to share your experience?

This work is part of LibrePower – Unlocking IBM Power Systems through open source. Unmatched RAS. Superior TCO. Minimal footprint 🌍

LibrePower AIX project repository: gitlab.com/librepower/aix

Porting MariaDB to IBM AIX (Part 2): how AIX matches Linux

From “AIX is Slow” to “AIX Matches Linux” (with the right tools and code)

Chapter 1: The Sinking Feeling

Chapter 2: The First 65x — Configuration Matters

The Cache That Forgot Everything

The Page Size Problem

The Scoreboard After Phase 1

Chapter 3: The CPU vs Memory Stall Mystery

The Root Cause: Hypervisor Cache Invalidation

Chapter 4: The Compiler Odyssey

Everything I Tried With GCC (And Why It Failed)

IBM Open XL: The Plot Twist

Why Such a Huge Difference?

Chapter 5: The LTO and PGO Dead Ends

LTO: The Irony

PGO: The Profiles That Never Were

Chapter 6: The LPAR Revelation

The Test Matrix

The WoF Factor

The PowerVM Donating Mode Disaster

The 21-Core Sweet Spot

Chapter 7: The Final Scoreboard (Plot Twist)

The Complete Failure Museum

What I Learned

1. Defaults Matter More Than You Think

2. The “AIX is Slow” Myth Was Always a Toolchain Issue

3. Virtualization Has Hidden Trade-offs

4. Not Every Optimization Ports

5. Sometimes There’s No Gap At All

Recommendations

For AIX MariaDB Users

For Upstream MariaDB

What’s Next

TL;DR

Blog!

Contact us!

Partners

Our mission

From “AIX is Slow” to “AIX Matches Linux” (with the right tools and code)

Chapter 1: The Sinking Feeling

Chapter 2: The First 65x — Configuration Matters

The Cache That Forgot Everything

The Page Size Problem

The Scoreboard After Phase 1

Chapter 3: The CPU vs Memory Stall Mystery

The Root Cause: Hypervisor Cache Invalidation

Chapter 4: The Compiler Odyssey

Everything I Tried With GCC (And Why It Failed)

IBM Open XL: The Plot Twist

Why Such a Huge Difference?

Chapter 5: The LTO and PGO Dead Ends

LTO: The Irony

PGO: The Profiles That Never Were

Chapter 6: The LPAR Revelation

The Test Matrix

The WoF Factor

The PowerVM Donating Mode Disaster

The 21-Core Sweet Spot

Chapter 7: The Final Scoreboard (Plot Twist)

The Complete Failure Museum

What I Learned

1. Defaults Matter More Than You Think

2. The “AIX is Slow” Myth Was Always a Toolchain Issue

3. Virtualization Has Hidden Trade-offs

4. Not Every Optimization Ports

5. Sometimes There’s No Gap At All

Recommendations

For AIX MariaDB Users

For Upstream MariaDB

What’s Next

TL;DR

You might also like

Blog!

Contact us!

Partners

Our mission