Porting MariaDB to IBM AIX: 3 Weeks of Engineering Pain

Part 1: Bringing MariaDB to the Platform That Powers the World’s Most Critical Systems

There are decisions in life you make knowing full well they’ll cause you some pain. Getting married. Having children. Running a marathon. Porting MariaDB 11.8 to IBM AIX.

This is the story of the last one — and why I’d do it again in a heartbeat.

Chapter 1: “How Hard Can It Be?”

It all started with an innocent question during a team meeting: “Why don’t we have MariaDB on our AIX systems?”

Here’s the thing about AIX that people who’ve never worked with it don’t understand: AIX doesn’t mess around. When banks need five-nines uptime for their core banking systems, they run AIX. When airlines need reservation systems that cannot fail, they run AIX. When Oracle, Informix, or DB2 need to deliver absolutely brutal performance for mission-critical OLTP workloads, they run on AIX.

AIX isn’t trendy. AIX doesn’t have a cool mascot. AIX won’t be the subject of breathless tech blog posts about “disruption.” But when things absolutely, positively cannot fail — AIX is there, quietly doing its job while everyone else is busy rebooting their containers.

So why doesn’t MariaDB officially support AIX? Simple economics: the open source community has centered on Linux, and porting requires platform-specific expertise. MariaDB officially supports Linux, Windows, FreeBSD, macOS, and Solaris. AIX isn’t on the list — not because it’s a bad platform, but because no one had done the work yet.

At LibrePower, that’s exactly what we do.

My first mistake was saying out loud: “It’s probably just a matter of compiling it and adjusting a few things.”

Lesson #1: When someone says “just compile it” about software on AIX, they’re about to learn a lot about systems programming.

Chapter 2: CMake and the Three Unexpected Guests

Day one of compilation was… educational. CMake on AIX is like playing cards with someone who has a very different understanding of the rules — and expects you to figure them out yourself.

The Ghost Function Bug

AIX has an interesting characteristic: it declares functions in headers for compatibility even when those functions don’t actually exist at runtime. It’s like your GPS saying “turn right in 200 meters” but the street is a brick wall.

CMake does a CHECK_C_SOURCE_COMPILES to test if pthread_threadid_np() exists. The code compiles. CMake says “great, we have it!” The binary starts and… BOOM. Symbol not found.

Turns out pthread_threadid_np() is macOS-only. AIX declares it in headers because… well, I’m still not entirely sure. Maybe for some POSIX compatibility reason that made sense decades ago? Whatever the reason, GCC compiles it happily, and the linker doesn’t complain until runtime.

Same story with getthrid(), which is OpenBSD-specific.

The fix:

IF(NOT CMAKE_SYSTEM_NAME MATCHES "AIX")
  CHECK_C_SOURCE_COMPILES("..." HAVE_PTHREAD_THREADID_NP)
ELSE()
  SET(HAVE_PTHREAD_THREADID_NP 0)  # Trust but verify... okay, just verify
ENDIF()

poll.h: Hide and Seek

AIX has <sys/poll.h>. It’s right there. You can cat it. But CMake doesn’t detect it.

After three hours debugging a “POLLIN undeclared” error in viosocket.c, I discovered the solution was simply forcing the define:

cmake ... -DHAVE_SYS_POLL_H=1

Three hours. For one flag.

(To be fair, this is a CMake platform detection issue, not an AIX issue. CMake’s checks assume Linux-style header layouts.)

The Cursed Plugins

At 98% compilation — 98%! — the wsrep_info plugin exploded with undefined symbols. Because it depends on Galera. Which we’re not using. But CMake compiles it anyway.

Also S3 (requires Aria symbols), Mroonga (requires Groonga), and RocksDB (deeply tied to Linux-specific optimizations).

Final CMake configuration:

-DPLUGIN_MROONGA=NO -DPLUGIN_ROCKSDB=NO -DPLUGIN_SPIDER=NO 
-DPLUGIN_TOKUDB=NO -DPLUGIN_OQGRAPH=NO -DPLUGIN_S3=NO -DPLUGIN_WSREP_INFO=NO

It looks like surgical amputation, but it’s actually just trimming the fat. These plugins are edge cases that few deployments need.

Chapter 3: Thread Pool, or How I Learned to Stop Worrying and Love the Mutex

This is where things got interesting. And by “interesting” I mean “I nearly gave myself a permanent twitch.”

MariaDB has two connection handling modes:

  • one-thread-per-connection: One thread per client. Simple. Scales like a car going uphill.
  • pool-of-threads: A fixed pool of threads handles all connections. Elegant. Efficient. And not available on AIX.

Why? Because the thread pool requires platform-specific I/O multiplexing APIs:

PlatformAPIStatus
LinuxepollSupported
FreeBSD/macOSkqueueSupported
Solarisevent portsSupported
WindowsIOCPSupported
AIXpollsetNot supported (until now)

So… how hard can implementing pollset support be?

(Editor’s note: At this point the author required a 20-minute break and a beverage)

The ONESHOT Problem

Linux epoll has a wonderful flag called EPOLLONESHOT. It guarantees that a file descriptor fires events only once until you explicitly re-arm it. This prevents two threads from processing the same connection simultaneously.

AIX pollset is level-triggered. Only level-triggered. No options. If data is available, it reports it. Again and again and again. Like a helpful colleague who keeps reminding you about that email you haven’t answered yet.

Eleven Versions of Increasing Wisdom

What followed were eleven iterations of code, each more elaborate than the last, trying to simulate ONESHOT behavior:

v1-v5 (The Age of Innocence)

I tried modifying event flags with PS_MOD. “If I change the event to 0, it’ll stop firing,” I thought. Spoiler: it didn’t stop firing.

v6-v7 (The State Machine Era)

“I know! I’ll maintain internal state and filter duplicate events.” The problem: there’s a time window between the kernel giving you the event and you updating your state. In that window, another thread can receive the same event.

v8-v9 (The Denial Phase)

“I’ll set the state to PENDING before processing.” It worked… sort of… until it didn’t.

v10 (Hope)

Finally found the solution: PS_DELETE + PS_ADD. When you receive an event, immediately delete the fd from the pollset. When you’re ready for more data, add it back.

// On receiving events: REMOVE
for (i = 0; i < ret; i++) {
    pctl.cmd = PS_DELETE;
    pctl.fd = native_events[i].fd;
    pollset_ctl(pollfd, &pctl, 1);
}

// When ready: ADD
pce.command = PS_ADD;
pollset_ctl_ext(pollfd, &pce, 1);

It worked! With -O2.

With -O3segfault.

The Dark Night of the Soul (The -O3 Bug)

Picture my face. I have code working perfectly with -O2. I enable -O3 for production benchmarks and the server crashes with “Got packets out of order” or a segfault in CONNECT::create_thd().

I spent two days thinking it was a compiler bug. GCC 13.3.0 on AIX. I blamed the compiler. I blamed the linker. I blamed everything except my own code.

The problem was subtler: MariaDB has two concurrent code paths calling io_poll_wait on the same pollset:

  • The listener blocks with timeout=-1
  • Workers poll with timeout=0 before going to sleep

With -O2, the code was slow enough that the race condition window was microscopic. With -O3, faster code execution made the race condition appear constantly.

This is actually a great example of why optimization matters on any platform. The same race condition exists in the code; optimization just changes whether you see it.

v11: Victory

The solution was adding a per-pollset mutex:

if (timeout_ms == 0) {
    if (pthread_mutex_trylock(lock) != 0)
        return 0;  // Listener is active, skip
} else {
    pthread_mutex_lock(lock);  // Block and wait
}
// ... poll + delete loop ...
pthread_mutex_unlock(lock);

100 concurrent connections. 1,000 connections. -O3 at full blast. Zero crashes.

I did it. I ACTUALLY DID IT.

Chapter 4: “-mcpu=power10? Sure, why not”

With the thread pool working, it was time to optimize for our hardware. We have POWER9 machines, and obviously -mcpu=power10 will generate better code with backward compatibility, right?

gcc -mcpu=power10 -mtune=power10 -O3 -o test_p10 test.c
./test_p10  # Runs on POWER9!

Great! Now let’s check the generated instructions:

grep -c "pld\|pstd\|plwa\|paddi" test_p10.s
# Result: 0

Zero. ZERO POWER10 instructions.

I compared the assembly from -mcpu=power9 vs -mcpu=power10:

power9: 133 lines
power10: 132 lines  
diff: Minor reordering, NO power10-specific instructions

Turns out GCC’s XCOFF backend (AIX’s native binary format — think of it as AIX’s way of packaging programs, different from Linux’s ELF format) doesn’t support POWER10 extensions. The prefixed instructions (pld, pstd), PC-relative addressing… none of that exists in the XCOFF world.

__MMA__ is defined (Matrix-Multiply Assist), but only useful if you write explicit code to use matrix operations. __PCREL__ and __PREFIXED__… undefined.

Takeaway: On AIX with GCC, -mcpu=power10 produces identical code to -mcpu=power9. Don’t waste time building separate binaries.

Chapter 5: The Toolchain Ecosystem

GNU Tools on AIX

AIX’s native toolchain maintains strict POSIX compatibility, which is actually a feature for portable code. But for those of us who’ve spent 20 years assuming sed -i exists everywhere, it requires some adjustment.

That’s why at LibrePower we maintain linux-compat, a collection of GNU tools packaged for AIX that makes modern development practical:

# With linux-compat from LibrePower, all of this works:
sed -i 's/foo/bar/' file.txt   # ✓
grep -A 5 "pattern" file       # ✓
seq 1 100                      # ✓
date +%s%N                     # ✓ (nanoseconds!)

Is it different from Linux? Sure. But you know what AIX does have? Decades of stability, rock-solid process scheduling, memory management tuned for high-throughput database workloads, and an I/O subsystem that Oracle, Informix, and DB2 engineers have been optimizing for since before some of today’s developers were born.

I’ll take “need to install GNU tools” over “random OOM killer in production” any day.

The Compiler Breakthrough: Open XL 17.1.3

This is where the story takes a dramatic turn.

IBM has xlC, their traditional compiler, and more recently Open XL C/C++, based on LLVM/Clang. We installed Open XL 17.1.3 (built on Clang 19.1.2) to evaluate it as an alternative to GCC 13.3.0.

The result: 3x faster. Not 3% — three times.

Compiler30-query batch (shared LPAR)Per-query avg
GCC 13.3.0~190ms (high variance)~6.3ms
Open XL 17.1.3~63ms (virtually zero variance)~2.1ms

Same hardware. Same flags (-O3 -mcpu=power9). Same code. Three times faster, with almost no run-to-run variance compared to GCC’s 10-40% variation.

Why such a difference?

GCC LTO (Link-Time Optimization) is impossible on AIX. GCC’s LTO implementation requires the ELF binary format; AIX uses XCOFF. The GCC configure script explicitly blocks it:

# Apart from ELF platforms, only Windows and Darwin support LTO so far.
if test x"$enable_lto" = x"yes"; then
  as_fn_error "LTO support is not enabled for this target."

This isn’t a missing package or build flag — it’s a fundamental architectural limitation.

Open XL, being LLVM-based, has much better code generation for POWER9 out of the box. The improvement comes from superior instruction scheduling, better register allocation, and optimization passes that GCC simply doesn’t have for this target.

The LTO Irony

We tried Open XL with -flto=full. It builds! On XCOFF! But… it’s 27% slower than non-LTO Open XL.

Why? AIX shared libraries require an explicit export list (exports.exp). With LTO, the linker sees ~27,000 symbols it must export. LTO’s main benefit is internalizing functions and eliminating dead code — but when you’re forced to export 27K symbols, that benefit evaporates. The additional LTO overhead actually makes it worse.

The PGO Dead End

Profile-Guided Optimization doesn’t work either. We tried everything:

  • The profiling runtime symbols (__llvm_prf_*) are local symbols that don’t get exported
  • Manually linking libclang_rt.profile-powerpc64.a and forcing -u__llvm_profile_runtime
  • Simple test cases work, but MariaDB’s complex shared library setup defeats the approach

The LLVM profiling runtime uses atexit() to write profiles. On AIX with XCOFF, shared library destructors have different semantics. The profiles never get written.

Bottom line: Open XL 17.1.3 without LTO or PGO gives us 3x improvement. That’s the pragmatic win for now.

The LPAR Mode Surprise: Shared Beats Dedicated

Here’s a counterintuitive finding that surprised us.

An LPAR (Logical Partition) is AIX’s virtualization technology — it lets you slice a physical server into multiple isolated virtual machines. You can configure these partitions with “shared” processors (flexible, borrowed from a pool) or “dedicated” processors (reserved just for you).

We tested MariaDB vector search on different LPAR configurations:

LPAR Mode30-query batch (Open XL)Per-query avg
Shared (12 vCPUs)~63ms~2.1ms
Dedicated-Capped (12 cores, 2750 MHz)~82ms~2.7ms
Dedicated-Donating~85ms (Open XL), ~325ms (GCC!)varies wildly

Shared LPAR is about 25% faster than dedicated for single-threaded vector queries.

Why? Two reasons:

  1. Workload Optimized Frequency (WoF): Think of it like a car with turbo boost. In shared mode with low utilization, POWER9 can boost single-core frequency up to ~3.8 GHz — like hitting the gas on an empty highway. Dedicated-Capped is fixed at 2750 MHz — like cruise control locked at 55 mph.
  2. Cycle borrowing: Shared mode can borrow idle cycles from the shared processor pool. Dedicated-Capped is strictly limited to allocated capacity — you only get what you paid for, even if your neighbors aren’t using theirs.

The Donating Mode Disaster

Dedicated-Donating mode is supposed to donate unused cycles back to the shared pool. In theory, you get dedicated processors when you need them and contribute to the pool when idle.

In practice, for bursty single-threaded workloads like vector search: 60-70% performance regression with GCC.

The cycle reclaim overhead is devastating. Every time the workload bursts, there’s latency reclaiming donated cycles. With GCC’s already higher variance, this creates terrible performance.

Open XL handles it better (only minor regression), but there’s no reason to use Donating mode for this workload.

Recommendation:

  • Single-threaded, latency-sensitive queries (like vector search): Use Shared LPAR — frequency boosting wins
  • Multi-threaded, throughput-focused workloads (like bulk OLTP): Use Dedicated-Capped — consistent performance matters more

This finding applies to any similar workload on POWER9 — not just MariaDB.

The -blibpath Thing (Actually a Feature)

One genuine AIX characteristic: you need to explicitly specify the library path at link time with -Wl,-blibpath:/your/path. If you don’t, the binary won’t find libstdc++ even if it’s in the same directory.

At first this seems annoying. Then you realize: AIX prefers explicit, deterministic paths over implicit searches. In production environments where “it worked on my machine” isn’t acceptable, that’s a feature, not a bug.

Chapter 6: The Numbers (Real Progress, Real Gaps)

After all this work, where do we actually stand? Let me be completely transparent.

What Works Solid

The RPM is published at aix.librepower.org and deployed on an IBM POWER9 system (12 cores, SMT-8). MariaDB 11.8.5 runs on AIX 7.3 with thread pool enabled. The server passed a brutal QA suite:

TestResult
100 concurrent connections
500 concurrent connections
1,000 connections
30 minutes sustained load
11+ million queries
Memory leaksZERO

1,648,482,400 bytes of memory — constant across 30 minutes. Not a single byte of drift. The server ran for 39 minutes under continuous load and performed a clean shutdown.

It works. It’s stable. It’s production-ready for functionality.

The Performance Journey

Here’s the complete picture, from baseline to current state:

ConfigurationMixed 100 clientsvs. Baseline
Original -O2 one-thread-per-connection11.34s
-O3 + pool-of-threads v11 (GCC)1.96s83% faster

The thread pool work delivered massive gains for concurrent workloads. For vector search specifically, Open XL delivers an additional 3x improvement over GCC.

The Gap vs Linux (Plot Twist: There Isn’t One)

Here’s the question everyone asks: “How does it compare to Linux?”

For vector search (MHNSW) — a workload that stresses memory access patterns — here’s what we measured on identical POWER9 hardware:

PlatformCores30 Queries
Linux POWER924 dedicated0.057s
AIX + Open XL (shared)12 vCPUs0.063s
AIX + Open XL (dedicated)21 cores0.067s
AIX + GCC 13.3.012 vCPUs0.190s

Wait — the AIX system has 21 cores vs Linux’s 24 (12.5% fewer). And the measured difference? 10-18%.

That’s not a performance gap. That’s a hardware difference.

With Open XL, AIX delivers the same per-core performance as Linux. The “AIX is slow” myth? Completely debunked. The difference was the compiler all along.

What We Tried (The Failure Museum)

Science isn’t just about what works — it’s about documenting what doesn’t. Here’s our wall of “nice try, but no”:

AttemptResultWhy
GCC LTOImpossibleXCOFF format; GCC LTO requires ELF
Open XL LTO27% slower27K forced exports negate LTO benefits
Open XL PGODoesn’t workShared lib profiling runtime issues on XCOFF
-funroll-loopsWorseI-cache bloat on POWER9
-ffast-mathBrokenIEEE violations corrupt bloom filter hashing
POWER VSX bloom filter41% slowerNo 64-bit vector multiply on POWER9
Software prefetchNo effectHypervisor evicts prefetched data
DSCR tuningBlockedHypervisor controls DSCR in shared LPAR

The VSX result is particularly interesting: we implemented a SIMD bloom filter using POWER’s vector extensions. It was 41% slower than scalar. POWER9 has no 64-bit vector multiply — you need vec_extract → scalar multiply → vec_insert for each lane, which is slower than letting the Out-of-Order engine handle a scalar loop.

What Actually Helped

FixImpact
Thread pool (pollset v11)72-87% improvement at high concurrency
-O3 vs -O211-29% improvement
mhnsw_max_cache_size = 4GB5x improvement for vector search
LDR_CNTRL 64K pagesReduced TLB misses
Open XL 17.1.33x faster than GCC
Shared LPAR (vs dedicated)~25% faster for single-threaded queries

This is Part 1. The work continues.


What I Learned

  1. CMake assumes Linux. On non-Linux systems, manually verify that feature detection is correct. False positives will bite you at runtime.
  2. Level-triggered I/O requires discipline. EPOLLONESHOT exists for a reason. If your system doesn’t have it, prepare to implement your own serialization.
  3. -O3 exposes latent bugs. If your code “works with -O2 but not -O3,” you have a race condition. The compiler is doing its job; the bug is yours.
  4. Mutexes are your friend. Yes, they have overhead. But you know what has more overhead? Debugging race conditions at 3 AM.
  5. AIX rewards deep understanding. It’s a system that doesn’t forgive shortcuts, but once you understand its conventions, it’s predictable and robust. There’s a reason banks still run it — and will continue to for the foreseeable future.
  6. The ecosystem matters. Projects like linux-compat from LibrePower make modern development viable on AIX. Contributing to that ecosystem benefits everyone.
  7. Optimization is a journey, not a destination. We got thread pool working. We got -O3 stable. But there’s more performance to unlock, and we’re not done yet.

What’s Next (Spoiler: Part 2 Is Ready)

The RPMs are published at aix.librepower.org. The GCC build is stable and production-ready for functionality. The Open XL build delivers 3x better performance but requires a commercial license for production use.

In Part 2, I’ll cover:

  • How we closed a 23x performance gap to 0%
  • The configuration fixes that gave us 65x improvement
  • The LPAR configuration deep dive
  • The complete “Failure Museum” of things that didn’t work

TL;DR

  • MariaDB 11.8.5 now runs on AIX 7.3 with thread pool enabled
  • First-ever thread pool implementation for AIX using pollset (11 iterations to get ONESHOT simulation right)
  • IBM Open XL C/C++ 17.1.3 delivers 3x speedup over GCC 13.3.0
  • GCC LTO is impossible on AIX (XCOFF vs ELF); Open XL LTO is counterproductive (27% slower due to forced exports)
  • Shared LPAR beats Dedicated for single-threaded queries (WoF frequency boosting: 3.8 GHz vs 2750 MHz)
  • Donating mode is disastrous: 60-70% regression with GCC — use Capped mode
  • POWER VSX bloom filter was 41% slower than scalar (no 64-bit vector multiply on POWER9)

Questions? Ideas? Want to contribute to the AIX open source ecosystem?

This work is part of LibrePower — Open source software for IBM Power Systems

AIX project repository: gitlab.com/librepower/aix

SIXE