Can I run vLLM on IBM POWER without a GPU?

Yes. The python3-vllm package from LibrePower runs in CPU-only mode on ppc64le. A single POWER10 core delivers 13.9 tok/s with Qwen2.5-0.5B-Instruct in BF16. POWER9 with 12 threads reaches 17.6 tok/s. No GPU, no PCIe slot, no driver setup required.

Does vLLM on POWER10 and POWER11 use MMA hardware acceleration?

Yes, and it happens automatically. PyTorch calls OpenBLAS, which uses the sbgemm_kernel_power10 kernel with MMA instructions for BF16 matrix multiplication. Benchmarks show BF16 is 1.54x faster than FP32 on POWER10, confirming MMA is active with no special configuration needed. POWER11 shares the same MMA architecture with further enhancements.

What is the difference between vLLM on POWER and Red Hat OpenShift AI?

OpenShift AI covers the full ML lifecycle (training, pipelines, serving, monitoring) but requires a Kubernetes cluster and a Red Hat subscription. The LibrePower vLLM package is inference-only and installs on any Ubuntu or Debian ppc64le system in minutes, at no cost.

Which LLM models work with vLLM on IBM POWER?

Any model compatible with vLLM's CPU backend works on ppc64le. Tested models include Qwen2.5-0.5B-Instruct and Qwen2.5-7B-Instruct. Llama, Mistral, Phi, and Granite are on the testing roadmap. The server exposes an OpenAI-compatible API, so any application using /v1/chat/completions can point directly to your POWER server.

vLLM on IBM POWER: LLM inference without a GPU

Q: How do I install vLLM on IBM POWER (ppc64le)?

Add the LibrePower APT repository with: curl -fsSL https://linux.librepower.org/install.sh | sudo sh — then run: sudo apt install python3-vllm. Install PyTorch wheels from IBM: pip3 install torch --extra-index-url https://wheels.developerfirst.ibm.com/ppc64le/linux. The package resolves all ppc64le-specific dependencies automatically.

LibrePower · Linux on Power · March 2026

vLLM on IBM POWER: LLM inference without a GPU

The first pre-built vLLM package for Linux ppc64le. Built by the community — and it runs on hardware you already own.

March 2026●18 min read

If you run IBM POWER systems, you know the drill. The hardware is extraordinary — POWER9, POWER10 and POWER11 deliver unmatched RAS, memory bandwidth, and per-core throughput. But when it comes to the AI/ML ecosystem, you’ve historically had two options: bring your own GPUs (usually x86), or go through Red Hat OpenShift AI. Today, there’s a third option for vLLM on IBM POWER. One that takes about 30 seconds, works on hardware you already own, and uses MMA hardware acceleration automatically.

The package

What we built: vLLM on IBM POWER as a .deb package

vLLM is the most popular open-source LLM serving engine. It powers inference at companies running millions of requests per day. It supports the full OpenAI API: /v1/chat/completions, /v1/completions, /v1/models — streaming, function calling, tool use, the works.

The problem? There were no pre-built packages for ppc64le. Not on PyPI. Not in Ubuntu’s repos. If you wanted vLLM on IBM POWER, you were on your own — figuring out build dependencies, patching C++ extensions, compiling PyTorch from source. IBM’s own community has documented how complex the manual setup is.

So we compiled it. On real IBM POWER hardware. Optimized for the architecture. And packaged it as a .deb that APT can install with full dependency resolution.

$ apt-cache show python3-vllm

Package: python3-vllm
Version: 0.9.2-1
Architecture: ppc64el
Maintainer: LibrePower <packages@librepower.org>
Depends: python3 (>= 3.10), python3-numpy, python3-requests
Homepage: https://linux.librepower.org
Description: OpenAI-compatible LLM inference server for ppc64le

Ubuntu on IBM POWER
Running Ubuntu on IBM POWER is a key enabler for this workflow. SIXE deploys and supports Ubuntu ppc64le environments with Canonical partnership — the same infrastructure that makes this apt-based installation possible.

Under the hood

The journey: from source to package

Getting vLLM to run on POWER is not a trivial pip install. Here’s what was involved.

PyTorch on POWER

vLLM depends on PyTorch, which is not distributed for ppc64le via PyPI. IBM publishes wheels at wheels.developerfirst.ibm.com — we leverage those as the base. See the full list of IBM-supported developer tools for POWER.

The C extension

vLLM’s performance-critical path is a C++ extension (_C.abi3.so) that handles attention, caching, activation functions, and quantization. This needs to be compiled from source with CMake, linking against PyTorch’s C++ API and oneDNN for optimized GEMM operations.

-- PowerPC detected
-- CPU extension compile flags: -fopenmp -DVLLM_CPU_EXTENSION
   -mvsx -mcpu=power9 -mtune=power9
-- CPU extension source files: csrc/cpu/quant.cpp csrc/cpu/activation.cpp
   csrc/cpu/attention.cpp csrc/cpu/cache.cpp csrc/cpu/utils.cpp
   csrc/cpu/layernorm.cpp csrc/cpu/pos_encoding.cpp
[100%] Linking CXX shared module _C.abi3.so
[100%] Built target _C

The resulting binary includes oneDNN with PPC64 GEMM kernels — the same math library that Intel uses for x86, but targeting POWER’s vector units.

Dependency resolution

The Python ecosystem on ppc64le has gaps. Some packages have pre-built wheels, others need compilation from source, and a few have version conflicts. We resolved all of this so you don’t have to.

In practice

Running LLM inference on IBM POWER: code and output

Here’s what it looks like in practice. First, install the package:

# Add the LibrePower APT repository
curl -fsSL https://linux.librepower.org/install.sh | sudo sh

# Install vLLM for ppc64le
sudo apt update
sudo apt install python3-vllm

# Install PyTorch wheels from IBM
pip3 install torch --extra-index-url \
  https://wheels.developerfirst.ibm.com/ppc64le/linux

Then run inference from Python:

# Python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    dtype="bfloat16",
    device="cpu",
    enforce_eager=True
)

output = llm.generate(
    ["Explain quantum computing in simple terms."],
    SamplingParams(temperature=0, max_tokens=100)
)

print(output[0].outputs[0].text)

But vLLM’s real value is the OpenAI-compatible server mode — this is what makes it useful for production:

# Start the OpenAI-compatible inference server
python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --device cpu --dtype bfloat16 --port 8000

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [{"role": "user", "content": "What is IBM POWER?"}],
    "max_tokens": 100
  }'

LangChain, LlamaIndex, Open WebUI, Continue.dev — anything that can point to an OpenAI endpoint works out of the box. Change base_url to your POWER server and you’re done. This is what makes CPU inference on IBM POWER a realistic path to private, self-hosted LLM deployment.

The numbers

Performance: real numbers on POWER9, POWER10 and POWER11

We benchmarked on both POWER generations with Qwen2.5-0.5B-Instruct (494M parameters, BF16). These are not theoretical numbers — they come from running the benchmark tool on actual hardware.

POWER9

$ OMP_NUM_THREADS=12 python3 bench_vllm.py
Run 1: 17.8 tok/s (100 tokens in 5.6s)
Run 2: 16.7 tok/s (100 tokens in 6.0s)
Run 3: 18.5 tok/s (100 tokens in 5.4s)
BENCH P9: threads=12 avg=17.6 tok/s

12 threads is optimal — more threads add cache contention on this memory-bandwidth-bound workload.

POWER10

$ OMP_NUM_THREADS=1 python3 bench_vllm.py
Run 1: 13.9 tok/s (100 tokens in 7.2s)
BENCH P10: threads=1 avg=13.9 tok/s

13.9 tok/s from a single POWER10 core. For context, the POWER9 result uses 12 threads across multiple cores to achieve 17.6 tok/s. The per-core efficiency improvement from POWER9 to POWER10 is dramatic, driven by MMA hardware acceleration. POWER11 shares the same MMA architecture with further enhancements.

System	Threads	tok/s	Per-core efficiency
POWER10/11	1	13.9	13.9 tok/s/core
POWER9	12	17.6	1.5 tok/s/core

This isn’t competing with an A100 — it’s filling a completely different gap: running LLM inference on IBM POWER infrastructure you already own. No GPU budget, no PCIe slots, no driver headaches. For organizations with existing POWER9, POWER10 or POWER11 servers, this is a zero-capital path to private AI.

We also tested Qwen2.5-7B-Instruct (7 billion parameters) on a single POWER10 core — it loaded and ran at 1.0 tok/s. Not fast enough for interactive use on one core, but proof that larger models work. With more cores, this scales linearly. Those running IBM POWER training courses through SIXE often ask about AI workloads on existing hardware — these numbers are the answer.

Inside the machine

What actually happens when a POWER10/11 runs an LLM

If you’ve seen IBM’s presentations about AI on POWER, you’ve probably encountered terms like MMA, Spyre, oneDNN, and OpenShift AI. They’re often shown together on the same slide. But what do they actually mean? And which ones are active when you run python3 -m vllm?

We went deep into the software stack to answer this. The findings surprised us.

A quick glossary (no jargon left behind)

LLM (large language model) — Software that generates text — ChatGPT, Llama, Qwen. A mathematical model with billions of numbers that predicts the next word.
Inference — Running a trained model to get answers. Training teaches the model; inference uses it. This article is entirely about inference.
Token — A word or piece of a word. “17.6 tokens per second” means roughly 17–18 words per second.
BF16 (bfloat16) — A way to store numbers using 16 bits instead of 32. Half the memory, nearly the same precision. Think: “good enough quality at half the storage cost.”
GEMM (general matrix multiply) — The core math operation in neural networks. Most compute time in LLM inference is spent multiplying large matrices.
MMA (matrix-multiply accumulate) — Special-purpose circuitry inside POWER10 and POWER11 designed to accelerate matrix math. Like a dedicated calculator for the one specific operation that dominates LLM inference.
OpenBLAS — An open-source math library with optimized GEMM routines. The engine that does the actual matrix multiplication on POWER.
oneDNN — Intel’s math library, also compiled into vLLM. Another engine for the same purpose.
PyTorch — The framework that runs the neural network. It calls OpenBLAS or oneDNN for the heavy math.

How the pieces fit together

When vLLM generates a token, here’s the exact path through the machine:

You type a question
↓
vLLM receives it and breaks it into tokens
↓
PyTorch runs the neural network math
↓
For each layer: multiply large matrices (GEMM)
↓
PyTorch asks OpenBLAS: “multiply these two BF16 matrices”
↓
OpenBLAS runs sbgemm_kernel_power10 ← THIS USES MMA
↓
POWER10/11 hardware executes MMA instructions
↓
Result flows back up, next token is chosen
↓
You see the next word appear

MMA acceleration is already active in our benchmarks. It’s not a future feature or a configuration flag — it works right now, through the path PyTorch → OpenBLAS → MMA hardware. No special setup required.

Proving it: BF16 vs FP32 on POWER10/11

On POWER10 and POWER11, MMA accelerates BF16 math. On POWER9 (no MMA), BF16 is actually slower than FP32 due to software emulation. If MMA is working, BF16 should be faster:

# Raw matrix multiplication benchmark (1024×1024) on POWER10
BF16: 384.4 GFLOPS  (5.6 ms)
FP32: 249.6 GFLOPS  (8.6 ms)
BF16/FP32 ratio: 1.54x

BF16 is 1.54× faster than FP32. MMA is active and delivering measurable hardware acceleration. Our 13.9 tok/s on a single POWER10 core already includes MMA. That’s the real, hardware-accelerated number. The power of POWER10 and POWER11’s AI acceleration capabilities is something we cover in depth in our Linux on IBM POWER Systems training courses.

The oneDNN investigation (and what we learned)

We initially thought there might be hidden performance left on the table.

The vLLM build bundles oneDNN (originally from Intel). Inside, there are two POWER-specific math paths:

int8 GEMM: A hand-written kernel by IBM engineers using MMA instructions for quantized models.
BF16 GEMM: A passthrough to OpenBLAS — but only when compiled with specific flags.

Our initial build didn’t have those flags. We recompiled with -DDNNL_BLAS_VENDOR=OPENBLAS, confirmed the flags were active, benchmarked again — same performance.

Why? PyTorch was already going directly to OpenBLAS, bypassing oneDNN for the main matrix operations. The optimization was already there; we just didn’t know it.

Practical takeaway: You don’t need to configure anything special. PyTorch on POWER10 and POWER11 with OpenBLAS automatically uses MMA for BF16 inference. Install the package and run.

What about IBM Spyre?

IBM Spyre is a dedicated AI accelerator card for POWER — a completely separate piece of hardware with its own silicon for AI math. Think of it this way:

MMA = built-in acceleration inside every POWER10 and POWER11 core (active right now in our benchmarks)
Spyre = a separate AI accelerator card you add to the system (promising, but requires specific IBM software stacks)

Our work focuses on what’s available today using the CPU already in your machine, with zero additional hardware investment.

The complete picture

Technology	What it is (plain English)	Active in our build?
POWER10/11 MMA (BF16)	Built-in matrix accelerator in the CPU	Yes — PyTorch → OpenBLAS
POWER10/11 MMA (int8)	Same hardware, for 8-bit quantized models	Built, not end-to-end yet
IBM Spyre	Separate AI accelerator card	No — different hardware
OpenShift AI	Full ML platform on Kubernetes	No — we’re the lightweight path
oneDNN	Math library bundled with vLLM	Compiled in, bypassed by PyTorch
OpenBLAS	Math library with hand-tuned POWER10/11 kernels	Yes — the real workhorse

Context

The bigger picture: LLM inference on IBM POWER without OpenShift

Red Hat OpenShift AI

Until now, the official IBM/Red Hat play for LLM inference on IBM POWER was OpenShift AI. It supports notebooks, pipelines, model training, serving, and monitoring. As of version 3.0, it runs on ppc64le with CPU-only workloads.

OpenShift AI is the right choice for organizations that already have OpenShift clusters. It comes with RBAC, InstructLab for model fine-tuning, and enterprise support.

But it requires OpenShift. A Kubernetes cluster, a Red Hat subscription, operator management. For many POWER shops — especially those running standalone Linux or mixed AIX/Linux — that’s a significant commitment just to serve a model. Organizations managing these environments often rely on SIXE’s IBM POWER maintenance and support services to keep them running.

What LibrePower adds

We’re not replacing OpenShift AI. We’re complementing it with a lighter path for the many POWER sites that don’t need the full platform.

	OpenShift AI	LibrePower vLLM
Install	OpenShift cluster + operators	`apt install python3-vllm`
Infrastructure	Kubernetes required	Any Ubuntu/Debian ppc64le
Scope	Full ML lifecycle	Inference serving only
Support	Red Hat subscription	Community (open source)
GPU	Supported (x86)	CPU-only (POWER native)
Time to first inference	Hours to days	Minutes
Cost	OpenShift licensing	Free

IBM builds the highway — world-class hardware, PyTorch wheels, OpenShift AI, InstructLab. LibrePower adds an on-ramp for people who don’t need the full platform. Both are needed. IBM’s roadmap for AI on IBM POWER is moving fast, and community tooling like this fills real gaps in the ecosystem today.

The infrastructure

How the LibrePower package repository works

We built linux.librepower.org following the same pattern as our AIX package repository — infrastructure that already serves 30+ open-source packages to AIX systems worldwide.

linux.librepower.org/
  dists/jammy/
    InRelease          (GPG signed)
    Release
    main/binary-ppc64el/
      Packages
  pool/main/
    python3-vllm_0.9.2-1_ppc64el.deb
  install.sh

CI/CD runs on GitLab: every push regenerates APT metadata and deploys automatically. All packages compiled on real IBM POWER hardware — not cross-compiled, not emulated. The full source is on GitLab under Apache 2.0.

Roadmap

What’s next for vLLM on IBM POWER

More models tested — Llama, Mistral, Phi, Granite. Systematic benchmarks across model families.
llama.cpp for ppc64le — GGUF quantized models for even lower memory footprint. Already shipping for AIX.
Ubuntu 24.04 and Debian 12 support — Extending the package to the latest LTS releases.
POWER10/11-optimized variants — Going deeper into MMA tuning. Our current 13.9 tok/s per core is a starting point, not a ceiling.
int8 GEMM end-to-end — Completing the MMA path for quantized models, which should improve throughput further.

Want to run AI workloads on your IBM POWER infrastructure?
SIXE helps organizations deploy and operate Linux on IBM POWER — from official IBM Linux on Power training to full infrastructure support. If you’re evaluating LLM inference on existing POWER hardware, talk to us.

Got a ppc64le system?

Try vLLM on IBM POWER now

If you have a system running Ubuntu, it’s three commands. Source is on GitLab if you want to dig in or contribute. IBM POWER training and infrastructure support by SIXE.

# Add the LibrePower repository
curl -fsSL https://linux.librepower.org/install.sh | sudo sh

# Install vLLM for ppc64le
sudo apt update && sudo apt install python3-vllm

# Install PyTorch (IBM wheels)
pip3 install torch --extra-index-url \
  https://wheels.developerfirst.ibm.com/ppc64le/linux

# Run the OpenAI-compatible inference server
python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --device cpu --dtype bfloat16 --port 8000

librepower.org
GitLab source
Newsletter

librepower.org
IBM POWER support by SIXE →

We just made LLM inference a sudo apt install on IBM POWER

vLLM on IBM POWER: LLM inference without a GPU

What we built: vLLM on IBM POWER as a .deb package

The journey: from source to package

PyTorch on POWER

The C extension

Dependency resolution

Running LLM inference on IBM POWER: code and output

Performance: real numbers on POWER9, POWER10 and POWER11

POWER9

POWER10

What actually happens when a POWER10/11 runs an LLM

A quick glossary (no jargon left behind)

How the pieces fit together

Proving it: BF16 vs FP32 on POWER10/11

The oneDNN investigation (and what we learned)

What about IBM Spyre?

The complete picture

The bigger picture: LLM inference on IBM POWER without OpenShift

Red Hat OpenShift AI

What LibrePower adds

How the LibrePower package repository works

What’s next for vLLM on IBM POWER

Try vLLM on IBM POWER now

Blog!

Contact us!

Partners

Our mission

vLLM on IBM POWER: LLM inference without a GPU

What we built: vLLM on IBM POWER as a .deb package

The journey: from source to package

PyTorch on POWER

The C extension

Dependency resolution

Running LLM inference on IBM POWER: code and output

Performance: real numbers on POWER9, POWER10 and POWER11

POWER9

POWER10

What actually happens when a POWER10/11 runs an LLM

A quick glossary (no jargon left behind)

How the pieces fit together

Proving it: BF16 vs FP32 on POWER10/11

The oneDNN investigation (and what we learned)

What about IBM Spyre?

The complete picture

The bigger picture: LLM inference on IBM POWER without OpenShift

Red Hat OpenShift AI

What LibrePower adds

How the LibrePower package repository works

What’s next for vLLM on IBM POWER

Try vLLM on IBM POWER now

You might also like

Blog!

Contact us!

Partners

Our mission