We just made LLM inference a sudo apt install on IBM POWER


LibrePower · Linux on Power · March 2026

vLLM on IBM POWER: LLM inference without a GPU

The first pre-built vLLM package for Linux ppc64le. Not from IBM. Not from Canonical. From the community — and it runs on hardware you already own.

March 202618 min read


If you run IBM POWER systems, you know the drill. The hardware is extraordinary — POWER9 and POWER10 deliver unmatched RAS, memory bandwidth, and per-core throughput. But when it comes to the AI/ML ecosystem, you’ve historically had two options: bring your own GPUs (usually x86), or go through Red Hat OpenShift AI. Today, there’s a third option for vLLM on IBM POWER. One that takes about 30 seconds, works on hardware you already own, and uses MMA hardware acceleration automatically.

The package

What we built: vLLM on IBM POWER as a .deb package

vLLM is the most popular open-source LLM serving engine. It powers inference at companies running millions of requests per day. It supports the full OpenAI API: /v1/chat/completions, /v1/completions, /v1/models — streaming, function calling, tool use, the works.

The problem? There were no pre-built packages for ppc64le. Not on PyPI. Not in Ubuntu’s repos. Not from IBM. If you wanted vLLM on IBM POWER, you were on your own — figuring out build dependencies, patching C++ extensions, compiling PyTorch from source. IBM’s own community has documented how complex the manual setup is.

So we compiled it. On real IBM POWER hardware. Optimized for the architecture. And packaged it as a .deb that APT can install with full dependency resolution.

$ apt-cache show python3-vllm

Package: python3-vllm
Version: 0.9.2-1
Architecture: ppc64el
Maintainer: LibrePower <packages@librepower.org>
Depends: python3 (>= 3.10), python3-numpy, python3-requests
Homepage: https://linux.librepower.org
Description: OpenAI-compatible LLM inference server for ppc64le
Ubuntu on IBM POWER
Running Ubuntu on IBM POWER is a key enabler for this workflow. SIXE deploys and supports Ubuntu ppc64le environments with Canonical partnership — the same infrastructure that makes this apt-based installation possible.

Under the hood

The journey: from source to package

Getting vLLM to run on POWER is not a trivial pip install. Here’s what was involved.

PyTorch on POWER

vLLM depends on PyTorch, which is not distributed for ppc64le via PyPI. IBM publishes wheels at wheels.developerfirst.ibm.com — we leverage those as the base. See the full list of IBM-supported developer tools for POWER.

The C extension

vLLM’s performance-critical path is a C++ extension (_C.abi3.so) that handles attention, caching, activation functions, and quantization. This needs to be compiled from source with CMake, linking against PyTorch’s C++ API and oneDNN for optimized GEMM operations.

-- PowerPC detected
-- CPU extension compile flags: -fopenmp -DVLLM_CPU_EXTENSION
   -mvsx -mcpu=power9 -mtune=power9
-- CPU extension source files: csrc/cpu/quant.cpp csrc/cpu/activation.cpp
   csrc/cpu/attention.cpp csrc/cpu/cache.cpp csrc/cpu/utils.cpp
   csrc/cpu/layernorm.cpp csrc/cpu/pos_encoding.cpp
[100%] Linking CXX shared module _C.abi3.so
[100%] Built target _C

The resulting binary includes oneDNN with PPC64 GEMM kernels — the same math library that Intel uses for x86, but targeting POWER’s vector units.

Dependency resolution

The Python ecosystem on ppc64le has gaps. Some packages have pre-built wheels, others need compilation from source, and a few have version conflicts. We resolved all of this so you don’t have to.

In practice

Running LLM inference on IBM POWER: code and output

Here’s what it looks like in practice. First, install the package:

# Add the LibrePower APT repository
curl -fsSL https://linux.librepower.org/install.sh | sudo sh

# Install vLLM for ppc64le
sudo apt update
sudo apt install python3-vllm

# Install PyTorch wheels from IBM
pip3 install torch --extra-index-url \
  https://wheels.developerfirst.ibm.com/ppc64le/linux

Then run inference from Python:

# Python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    dtype="bfloat16",
    device="cpu",
    enforce_eager=True
)

output = llm.generate(
    ["Explain quantum computing in simple terms."],
    SamplingParams(temperature=0, max_tokens=100)
)

print(output[0].outputs[0].text)

But vLLM’s real value is the OpenAI-compatible server mode — this is what makes it useful for production:

# Start the OpenAI-compatible inference server
python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --device cpu --dtype bfloat16 --port 8000
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [{"role": "user", "content": "What is IBM POWER?"}],
    "max_tokens": 100
  }'

LangChain, LlamaIndex, Open WebUI, Continue.dev — anything that can point to an OpenAI endpoint works out of the box. Change base_url to your POWER server and you’re done. This is what makes CPU inference on IBM POWER a realistic path to private, self-hosted LLM deployment.

The numbers

Performance: real numbers on POWER9 and POWER10

We benchmarked on both POWER generations with Qwen2.5-0.5B-Instruct (494M parameters, BF16). These are not theoretical numbers — they come from running the benchmark tool on actual hardware.

POWER9

$ OMP_NUM_THREADS=12 python3 bench_vllm.py
Run 1: 17.8 tok/s (100 tokens in 5.6s)
Run 2: 16.7 tok/s (100 tokens in 6.0s)
Run 3: 18.5 tok/s (100 tokens in 5.4s)
BENCH P9: threads=12 avg=17.6 tok/s

12 threads is optimal — more threads add cache contention on this memory-bandwidth-bound workload.

POWER10

$ OMP_NUM_THREADS=1 python3 bench_vllm.py
Run 1: 13.9 tok/s (100 tokens in 7.2s)
BENCH P10: threads=1 avg=13.9 tok/s
13.9 tok/s from a single POWER10 core. For context, the POWER9 result uses 12 threads across multiple cores to achieve 17.6 tok/s. The per-core efficiency improvement from POWER9 to POWER10 is dramatic, driven by MMA hardware acceleration.
SystemThreadstok/sPer-core efficiency
POWER10113.913.9 tok/s/core
POWER91217.61.5 tok/s/core

This isn’t competing with an A100 — it’s filling a completely different gap: running LLM inference on IBM POWER infrastructure you already own. No GPU budget, no PCIe slots, no driver headaches. For organizations with existing POWER9 or POWER10 servers, this is a zero-capital path to private AI.

We also tested Qwen2.5-7B-Instruct (7 billion parameters) on a single POWER10 core — it loaded and ran at 1.0 tok/s. Not fast enough for interactive use on one core, but proof that larger models work. With more cores, this scales linearly. Those running IBM POWER training courses through SIXE often ask about AI workloads on existing hardware — these numbers are the answer.

Inside the machine

What actually happens when a POWER10 runs an LLM

If you’ve seen IBM’s presentations about AI on POWER, you’ve probably encountered terms like MMA, Spyre, oneDNN, and OpenShift AI. They’re often shown together on the same slide. But what do they actually mean? And which ones are active when you run python3 -m vllm?

We went deep into the software stack to answer this. The findings surprised us.

A quick glossary (no jargon left behind)

  • LLM (large language model) — Software that generates text — ChatGPT, Llama, Qwen. A mathematical model with billions of numbers that predicts the next word.
  • Inference — Running a trained model to get answers. Training teaches the model; inference uses it. This article is entirely about inference.
  • Token — A word or piece of a word. “17.6 tokens per second” means roughly 17–18 words per second.
  • BF16 (bfloat16) — A way to store numbers using 16 bits instead of 32. Half the memory, nearly the same precision. Think: “good enough quality at half the storage cost.”
  • GEMM (general matrix multiply) — The core math operation in neural networks. Most compute time in LLM inference is spent multiplying large matrices.
  • MMA (matrix-multiply accumulate) — Special-purpose circuitry inside POWER10 designed to accelerate matrix math. Like a dedicated calculator for the one specific operation that dominates LLM inference.
  • OpenBLAS — An open-source math library with optimized GEMM routines. The engine that does the actual matrix multiplication on POWER.
  • oneDNN — Intel’s math library, also compiled into vLLM. Another engine for the same purpose.
  • PyTorch — The framework that runs the neural network. It calls OpenBLAS or oneDNN for the heavy math.

How the pieces fit together

When vLLM generates a token, here’s the exact path through the machine:

You type a question

vLLM receives it and breaks it into tokens

PyTorch runs the neural network math

For each layer: multiply large matrices (GEMM)

PyTorch asks OpenBLAS: “multiply these two BF16 matrices”

OpenBLAS runs sbgemm_kernel_power10 ← THIS USES MMA

POWER10 hardware executes MMA instructions

Result flows back up, next token is chosen

You see the next word appear
MMA acceleration is already active in our benchmarks. It’s not a future feature or a configuration flag — it works right now, through the path PyTorch → OpenBLAS → MMA hardware. No special setup required.

Proving it: BF16 vs FP32 on POWER10

On POWER10, MMA accelerates BF16 math. On POWER9 (no MMA), BF16 is actually slower than FP32 due to software emulation. If MMA is working, BF16 should be faster:

# Raw matrix multiplication benchmark (1024×1024) on POWER10
BF16: 384.4 GFLOPS  (5.6 ms)
FP32: 249.6 GFLOPS  (8.6 ms)
BF16/FP32 ratio: 1.54x

BF16 is 1.54× faster than FP32. MMA is active and delivering measurable hardware acceleration. Our 13.9 tok/s on a single POWER10 core already includes MMA. That’s the real, hardware-accelerated number. The power of POWER10’s AI acceleration capabilities is something we cover in depth in our Linux on IBM POWER Systems training courses.

The oneDNN investigation (and what we learned)

We initially thought there might be hidden performance left on the table.

The vLLM build bundles oneDNN (originally from Intel). Inside, there are two POWER-specific math paths:

  • int8 GEMM: A hand-written kernel by IBM engineers using MMA instructions for quantized models.
  • BF16 GEMM: A passthrough to OpenBLAS — but only when compiled with specific flags.

Our initial build didn’t have those flags. We recompiled with -DDNNL_BLAS_VENDOR=OPENBLAS, confirmed the flags were active, benchmarked again — same performance.

Why? PyTorch was already going directly to OpenBLAS, bypassing oneDNN for the main matrix operations. The optimization was already there; we just didn’t know it.

Practical takeaway: You don’t need to configure anything special. PyTorch on POWER10 with OpenBLAS automatically uses MMA for BF16 inference. Install the package and run.

What about IBM Spyre?

IBM Spyre is a dedicated AI accelerator card for POWER — a completely separate piece of hardware with its own silicon for AI math. Think of it this way:

  • MMA = built-in acceleration inside every POWER10 core (active right now in our benchmarks)
  • Spyre = a separate AI accelerator card you add to the system (promising, but requires specific IBM software stacks)

Our work focuses on what’s available today using the CPU already in your machine, with zero additional hardware investment.

The complete picture

TechnologyWhat it is (plain English)Active in our build?
POWER10 MMA (BF16)Built-in matrix accelerator in the CPUYes — PyTorch → OpenBLAS
POWER10 MMA (int8)Same hardware, for 8-bit quantized modelsBuilt, not end-to-end yet
IBM SpyreSeparate AI accelerator cardNo — different hardware
OpenShift AIFull ML platform on KubernetesNo — we’re the lightweight path
oneDNNMath library bundled with vLLMCompiled in, bypassed by PyTorch
OpenBLASMath library with hand-tuned POWER10 kernelsYes — the real workhorse

Context

The bigger picture: LLM inference on IBM POWER without OpenShift

Red Hat OpenShift AI

Until now, the official IBM/Red Hat play for LLM inference on IBM POWER was OpenShift AI. It supports notebooks, pipelines, model training, serving, and monitoring. As of version 3.0, it runs on ppc64le with CPU-only workloads.

OpenShift AI is the right choice for organizations that already have OpenShift clusters. It comes with RBAC, InstructLab for model fine-tuning, and enterprise support.

But it requires OpenShift. A Kubernetes cluster, a Red Hat subscription, operator management. For many POWER shops — especially those running standalone Linux or mixed AIX/Linux — that’s a significant commitment just to serve a model. Organizations managing these environments often rely on SIXE’s IBM POWER maintenance and support services to keep them running.

What LibrePower adds

We’re not replacing OpenShift AI. We’re complementing it with a lighter path for the many POWER sites that don’t need the full platform.

OpenShift AILibrePower vLLM
InstallOpenShift cluster + operatorsapt install python3-vllm
InfrastructureKubernetes requiredAny Ubuntu/Debian ppc64le
ScopeFull ML lifecycleInference serving only
SupportRed Hat subscriptionCommunity (open source)
GPUSupported (x86)CPU-only (POWER native)
Time to first inferenceHours to daysMinutes
CostOpenShift licensingFree

IBM builds the highway — world-class hardware, PyTorch wheels, OpenShift AI, InstructLab. LibrePower adds an on-ramp for people who don’t need the full platform. Both are needed. IBM’s roadmap for AI on IBM POWER is moving fast, and community tooling like this fills real gaps in the ecosystem today.

The infrastructure

How the LibrePower package repository works

We built linux.librepower.org following the same pattern as our AIX package repository — infrastructure that already serves 30+ open-source packages to AIX systems worldwide.

linux.librepower.org/
  dists/jammy/
    InRelease          (GPG signed)
    Release
    main/binary-ppc64el/
      Packages
  pool/main/
    python3-vllm_0.9.2-1_ppc64el.deb
  install.sh

CI/CD runs on GitLab: every push regenerates APT metadata and deploys automatically. All packages compiled on real IBM POWER hardware — not cross-compiled, not emulated. The full source is on GitLab under Apache 2.0.

Roadmap

What’s next for vLLM on IBM POWER

  • More models tested — Llama, Mistral, Phi, Granite. Systematic benchmarks across model families.
  • llama.cpp for ppc64le — GGUF quantized models for even lower memory footprint. Already shipping for AIX.
  • Ubuntu 24.04 and Debian 12 support — Extending the package to the latest LTS releases.
  • POWER10-optimized variants — Going deeper into MMA tuning. Our current 13.9 tok/s per core is a starting point, not a ceiling.
  • int8 GEMM end-to-end — Completing the MMA path for quantized models, which should improve throughput further.
Want to run AI workloads on your IBM POWER infrastructure?
SIXE helps organizations deploy and operate Linux on IBM POWER — from official IBM Linux on Power training to full infrastructure support. If you’re evaluating LLM inference on existing POWER hardware, talk to us.

Got a ppc64le system?

Try vLLM on IBM POWER now

If you have a system running Ubuntu, it’s three commands. Source is on GitLab if you want to dig in or contribute. IBM POWER training and infrastructure support by SIXE.

# Add the LibrePower repository
curl -fsSL https://linux.librepower.org/install.sh | sudo sh

# Install vLLM for ppc64le
sudo apt update && sudo apt install python3-vllm

# Install PyTorch (IBM wheels)
pip3 install torch --extra-index-url \
  https://wheels.developerfirst.ibm.com/ppc64le/linux

# Run the OpenAI-compatible inference server
python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --device cpu --dtype bfloat16 --port 8000
SIXE