What is vLLM and why does it matter for an AI factory?

vLLM is a high-performance inference engine for large language models. It implements PagedAttention to maximise VRAM utilisation and serve multiple requests in parallel. Deployed on Kubernetes, it exposes an OpenAI-compatible API that makes migrating existing applications straightforward. In benchmarks, vLLM achieves 3–5x the throughput of naive implementations of the same model.

What is the difference between an AI factory and using ChatGPT or AWS Bedrock?

Three key differences: data sovereignty (your data never leaves your data centre), predictable costs (no GPU billing surprises), and zero vendor lock-in (the entire stack is open source and portable). On-premise also allows you to run fine-tuned proprietary models without exposing weights to third parties.

What hardware do I need to build an on-premise AI factory?

For models up to 70B parameters, a minimum of 2–3 servers with NVIDIA A100 (80 GB VRAM) or L40S GPUs is a solid starting point. For storage, at least 3 nodes with NVMe drives for the Ceph cluster are recommended. Exact sizing depends on your target models, throughput requirements, and P99 latency targets.

What Is an AI Factory and How to Build One with Open Source

Q: What is an AI factory?

An AI factory is a specialised compute infrastructure that integrates storage, GPU compute and orchestration to run AI models in production continuously, at scale, and under full organisational control. It differs from ad-hoc AI setups in its production-grade reliability, multi-model serving, and operational tooling.

AI Infrastructure · March 2026

What is an AI Factory and how to build one with open source in your own data centre

The AI Factory concept has been everywhere for two years — but few organisations actually understand what it takes to build one, or how to do it without being locked into a cloud provider. Here’s a straight-talking breakdown, with the specific stack we use in production.

March 2026●20 min read

An AI factory is not a server with a GPU and a model downloaded from Hugging Face. It is a distributed compute infrastructure designed to run language and vision models in production continuously, at scale, and under full organisational control. The good news: building one is no longer the exclusive privilege of hyperscalers. The open source technology powering the Barcelona Supercomputing Center’s AI Factory, and the sovereign AI infrastructure programmes across Europe, is available to any organisation with its own data centre. What follows is a practical guide to what you need, what you don’t, and how to decide whether it makes sense for you.

Just a bit of context

What exactly is an AI factory?

The term “AI Factory” was popularised by NVIDIA’s Jensen Huang in 2023 to describe what data centres are becoming: machines that produce intelligence continuously, the way a factory produces goods. The metaphor isn’t poetic — it’s technically precise.

A classic AI factory has four distinct components: a storage system for model weights and datasets (which run to tens or hundreds of gigabytes), a GPU compute layer for running inference, an orchestrator managing which model runs on which hardware, and an API that exposes models to the rest of the organisation. When those four components work efficiently together, you have an AI factory.

What differentiates it from “having an LLM running on a server” is scale, reliability, and management. An AI factory serves multiple models in parallel, handles request queuing, guarantees availability, and monitors resource usage. It’s production infrastructure — not a test environment.

Worth knowing

The European Commission has committed over €1.5 billion to building AI Factories distributed across member states under the EuroHPC programme. The explicit goal is for Europe to have sovereign AI infrastructure that doesn’t depend on US or Asian providers. Spain participates through the BSC. The same technology stack they use can be deployed in your data centre.

Why bring AI inference in-house?

Why organisations are moving their AI workloads on-premises

Three arguments come up in every conversation we have with clients evaluating on-premise AI infrastructure. These aren’t marketing talking points — they’re operational and financial realities.

Predictable costs

GPU bills in public cloud can swing 30–40% between billing cycles depending on demand. With your own infrastructure, the cost is fixed, depreciable, and entirely predictable. At medium inference volumes, the investment typically pays back in 12–18 months compared to cloud.

Zero vendor lock-in

Proprietary APIs, closed formats, fine-tuned models living in someone else’s infrastructure. With an open source stack, your models and your data are yours — always portable, no exit negotiations, no lock-in contracts.

?️

Regulatory compliance

GDPR and the EU AI Act require knowing precisely where data is processed. If your inference touches patient data, citizen records, or financial data, you need full control over the infrastructure. On-premises is the only architecturally sound answer.

The question is no longer whether to build your own AI infrastructure, but when and how to do it without repeating the mistakes of the cloud rush a decade ago: velocity without architecture.
— SIXE Technical Team

That said, on-premise AI infrastructure isn’t right for everyone. If you’re running ten inference requests a day and have no strict regulatory requirements, cloud is probably the right answer right now. On-premises starts making sense when volumes are sustained, when data is sensitive, or when you need to run proprietary fine-tuned models without exposing weights to third parties.

OK so… how do I actually build this?

The open source stack: three technologies, zero proprietary dependencies

A combination of three technologies has emerged as the de facto standard for building on-premise AI factories in European enterprise environments. The same stack the BSC uses. The same technologies driving sovereign AI infrastructure in France, Germany, and Italy. And the same stack we deploy at SIXE.

Ceph: distributed storage built for AI workloads

Language models are heavy. Llama 3 70B occupies roughly 40 GB in float16 precision. Mixtral 8x7B sits around 90 GB. A reasonable model catalogue for a mid-sized organisation can easily exceed 500 GB — before accounting for fine-tuning datasets or inference logs.

Ceph solves this with distributed storage that unifies object storage (natively S3-compatible), block storage, and filesystem in a single cluster. It scales from terabytes to petabytes without interruption, supports erasure coding for storage efficiency, and has native Kubernetes integration via CSI. In an AI factory, Ceph acts as the backbone where model weights, datasets, and inference results live.

SIXE Perspective

We are a Canonical Partner and have been deploying Ceph clusters in production for years, including AI and HPC environments. Ceph is not “tick a checkbox”: it requires careful sizing, network design, and replication policies adapted to your workload. Three-node clusters have quorum considerations that shouldn’t be improvised. We offer dedicated training and support so your team can operate Ceph with real autonomy — not permanent consulting dependency.

OpenStack: your private cloud with native GPU management

OpenStack turns your hardware into a private enterprise cloud. For an AI factory, its primary role is GPU resource management: PCI passthrough for direct GPU access from VMs, vGPU for sharing a physical GPU across multiple workloads, and NVIDIA MIG (Multi-Instance GPU) for partitioning A100 and H100 GPUs into independent instances.

Under the Linux Foundation since 2024, OpenStack runs in production across more than 45 million cores at organisations including Walmart, GEICO, and LINE Corp. This isn’t emerging technology — it’s proven infrastructure at real scale, with independent governance that guarantees continuity.

Worth noting

OpenStack is not trivial. It spans more than 30 service projects and requires teams with experience in distributed systems. If your team comes from a VMware background, the learning curve is real. Our training service covers practical, hands-on upskilling so your team can operate the stack independently — without long-term consulting dependency.

Kubernetes + vLLM: the inference orchestration layer

Kubernetes is the CNCF standard for container workload orchestration, with native GPU scheduling via the NVIDIA GPU Operator. The inference engines are deployed on top of Kubernetes — and vLLM is the most significant for language models right now.

vLLM implements PagedAttention, a technique that manages KV cache memory efficiently and enables parallel serving of multiple requests without wasting VRAM. In representative benchmarks, vLLM delivers 3–5x the throughput of a naive implementation of the same model. It exposes an OpenAI-compatible API, which makes migrating applications already consuming GPT-4 or similar models straightforward.

For vision or embedding models, NVIDIA’s Triton Inference Server complements vLLM and enables hardware-specific optimisations such as TensorRT-LLM.

How does an AI factory take shape?

Reference architecture: from data to model in production

An on-premise AI factory with this stack follows a four-layer flow. It’s not the only possible design, but it’s the one that best balances operational complexity, performance, and portability.

01 — Data

Ceph S3

Models, datasets, inference results. S3-compatible API for integration with MLOps pipelines.

02 — Compute

OpenStack

GPU scheduling, bare metal, isolated project networks. PCI passthrough and MIG for maximum efficiency.

03 — Orchestration

Kubernetes

GPU Operator, inference pod autoscaling, deployment lifecycle management.

04 — Production

vLLM / Triton

Inference APIs, RAG, agents. OpenAI-compatible endpoints for friction-free integration.

The key to this design is that each layer is independent and replaceable. If a better orchestrator than Kubernetes emerges for AI workloads tomorrow, you can swap it out without touching storage or the compute layer. That’s what zero vendor lock-in really means: not just open source software, but genuine separation of concerns in the architecture.

Component

Role in the factory

Viable alternatives

Governance

Ceph

Model and data storage

IBM Storage Scale (GPFS)

Linux Foundation

OpenStack

Private cloud with GPU management

MaaS + direct bare metal

OpenInfra / LF

Kubernetes

Container orchestration

MicroK8s, OpenShift

CNCF / LF

vLLM

LLM inference engine

Triton, TensorRT-LLM

Apache 2.0

Ubuntu / Canonical

Base OS + stack support

RHEL, SUSE

Canonical Partner

Is this right for my organisation?

Who actually needs an on-premise AI factory

Not every sector has the same urgency or the same constraints. In four areas, on-premise AI infrastructure isn’t a preference — it’s the only architecturally viable answer.

Healthcare & pharma

Clinical records, diagnostic imaging, genomic data. GDPR and the EU Health Data Space Regulation impose strict restrictions on transfers to third countries without explicit safeguards. On-premise inference is the default compliance architecture.

Banking & insurance

Credit scoring, fraud detection, risk analysis. EBA guidelines on AI in financial services and the EU AI Act classify these systems as high-risk, with traceability and control requirements that only on-premises architectures can meet.

?️

Public sector & defence

Technological sovereignty, NIS2, classified data. The EU’s AI strategy explicitly requires that public-facing AI systems operate on European or national infrastructure. No discussion needed.

Industry & manufacturing

Computer vision on production lines, predictive maintenance, quality control. Cloud latency is not viable when you need millisecond response times on the factory floor. Edge or on-premises inference is the only workable model.

FAQ

Questions to answer before you start

Building an on-premise AI factory is not a weekend project. It requires honest prior analysis across four dimensions that determine whether it makes sense and how to execute it well.

Which models are you serving, and at what request volumes?

GPU sizing depends directly on model size (parameter count and precision) and throughput targets (requests per second, acceptable P99 latency). A 7B parameter model in float16 fits in a single L40S GPU with 48 GB of VRAM. A 70B model requires multiple GPUs with tensor parallelism. There are no shortcuts here: correct sizing requires knowing real workloads, not optimistic estimates.

Does your team have the capacity to operate this stack?

The most important question — and the one asked least often. A team with experience in Linux, Kubernetes and distributed systems can learn to operate this stack. But if you’re starting from scratch, the learning curve needs to be inside the plan, not outside it. SIXE offers certified training in Ceph, OpenStack and Kubernetes (as an IBM Business Partner and Canonical Partner) precisely so the transition doesn’t create permanent consulting dependency.

What is the real 3-year TCO?

The software is open source, so there are no licence costs. The investment is hardware (GPUs, servers, high-speed networking) plus team upskilling. Compared to cloud GPU costs at the same inference volume over that period, the numbers tend to speak for themselves. But the financial model must include maintenance, updates, and staff operating time. Nothing is free — and projects that start from that assumption tend to hit unpleasant surprises.

How we work at SIXE

Before any deployment, we carry out an architecture assessment: we audit your real workloads, latency requirements, data volumes, and regulatory obligations. We deliver a complete design — GPU sizing, network topology, Ceph storage layout, and a 12–24 month capacity plan. No smoke, no savings promises we haven’t calculated. Just a technical analysis of whether it makes sense, and how to execute it.

Have an AI inference project in mind?

Your AI factory, built with the stack we use ourselves

IBM Business Partner and Canonical Partner. Over 15 years deploying open source in production. We design the architecture, train your team, and support you until the infrastructure runs on its own. Our work ends when yours truly begins.

See the inference service →
Message us on WhatsApp