Open source storage for AI and HPC: when Ceph is no longer an alternative but the only viable way forward

When CERN needs to store and process data from the Large Hadron Collider ( LHC, the world’s largest and most powerful particle accelerator), scale is everything. At this level, technology and economics converge in a clear conclusion: open source technologies such as Ceph, EOS and Lustre are not an “alternative” to traditional enterprise solutions; in many scenarios, they are the only viable way forward.

With more than 1 exabyte of disk storage, 7 billion files y 45 petabytes per week processed during data collection campaigns, the world’s largest particle physics laboratory is moving into a field where the classical models of capacity licensing models no longer no longer make economic sense.

This reality, documented in the paper presented at CHEP 2025, “Ceph at CERN in the multi-datacentre era”, reflects what more and more universities and research centers are finding: there are use cases where open source does not compete with enterprise solutions.reflects what more and more universities and research centers are realizing: there are use cases where open source does not compete with enterprise solutions, it defines its own categoryIt defines its own category, for which traditional architectures were simply not designed.

open source cern storage

CERN: numbers that change the rules

The CERN figures are not only impressive; they explain why certain technologies are chosen:

  • >1 exabyte of disk storage, distributed over ~2,000 servers with 60,000 disks.

  • >4 exabytes of annual transfers.

  • Up to 45 PB/week and sustained throughput >10 GB/s sustained throughput in data collection periods.

Architecture is heterogeneous by necessity:

  • EOS for physics files (more than 1 EB).

  • CTA (CERN Tape Archive) for long-term archiving.

  • Ceph (more than 60 PB) for blocks, S3 objects and CephFS, backing up OpenStack.

It is not only the volume that is relevant, but also the trajectory. In a decade, they have gone from a few petabytes to exabytes. without disruptive architectural leapsadding nodes commodity horizontally. This elasticity does not exist in the proprietary cabins with capacity licenses.

The economics of the exabyte: where capacity models fail

Current licensing models in the enterprise market are reasonable for typical environments. for typical environments (tens or hundreds of terabytes, predictable growth, balanced CapEx and OpEx). They provide integration, 24×7 support, certifications and a partner ecosystem. But at petabyte or exabyte scale with rapid growth, the equation changes.

  • At SIXE we are IBM Premier Partnerand we have evolved towards capacity-based licensing.

    • IBM Spectrum Virtualize uses Storage Capacity Units (SCU)~1 TB per SCU. The annual cost per SCU can range from 445 y 2.000 €depending on volume, customer profile and environmental conditions.

    • IBM Storage Defender uses Resource Units (RUs). For example, IBM Storage Protect consumes 17 RUs/TB for the first 100 TB and 15 RUs/TB for the next 250 TB, allowing resiliency capabilities to be combined under a unified license.

  • Similar models exist at NetApp (term-capacity licensing), Pure Storage, Dell Technologies and others: pay for managed or provisioned capacity..

All of this works in conventional enterprise environments. However, manage 60 PB under per-capacity licensing, even with high volume discounts, can translate into millions of euros per year in software alonewithout counting hardware, support or services. At that point, the question is no longer whether open source is “viable”, but rather whether it is is there any realistic alternative to it for these scales.

Technical capabilities: an already mature open source

The economic advantage would not apply if the technology were inferior. This is not the case. For certain AI and HPC loads, the capabilities are equivalent or higher:

  • Ceph offers unified storage virtualization with thin provisioning, compression at BlueStore, snapshots y COW clones without significant penalty, multisite replication (RGW and RBD), and tiering between media, and if you want your team to understand how to take advantage of Ceph, we have…

  • CERN documents multi-datacenter strategies for business continuity and disaster recovery using stretch clusters y multisite replicationwith RPO/RTO comparable to enterprise solutions.

IBM recognizes this maturity with IBM Storage Ceph (a derivative of Red Hat Ceph Storage), which combines open source technology technology with support, certifications and SLAs enterprise level. At SIXEas an IBM Premier Partnerwe implemented IBM Storage Ceph when business support is required and also Ceph upstream when flexibility and independence are prioritized.

Key architectural difference:

  • IBM Spectrum Virtualize is an enterprise layer that manages heterogeneous storage from blockwith dedicated nodes or instances, and advanced mobility, replication and automation features.

  • Ceph is a native native distributed system that serves blocks, objects and files from the same horizontal infrastructureeliminating silos. At pipelines for datasets, blocks for metadata, file shares for collaboration – this unification brings clear operational advantages clear operational advantages.

Conceptual digital illustration symbolizing mature open source storage technology. Three distinct data flows (subtly different colors) converge into a single glowing structure, symbolizing integration and scalability. The environment evokes a modern data center with soft blue and white lighting, clean geometry, and a sense of precision and reliability.

Large-scale AI and HPC: where the distributed shines

The training training of foundational models read petabytes in parallel in parallelwith aggregate bandwidths of 100 GB/s or more. The inference requires sub-10 ms latencies with thousands of concurrent requests.

Traditional architectures with SAN controllers controllers suffer bottlenecks when hundreds of GPUS (A100, H100…) access data at the same time. It is estimated that about 33 % of GPUs in corporate AI environments operate at less than 15 15 % utilization due to storage saturationwith the consequent cost in underutilized underutilized assets.

Distributed architectures architecturesCeph, Lustre, BeeGFS– were born for these patterns:

  • Luster drives 7 of the 10 supercomputers in the Top500supercomputers, with >1 TB/s aggregate throughput in large installations. Frontier (ORNL) uses ~700 PB in Lustre and writes >35 TB/s sustained.

  • BeeGFS scales storage and metadata independently independentlyexceeding 50 GB/s sustained with tens of thousands of customers in production.

  • MinIOoptimized for objects in AI, has demonstrated >2.2 TiB/s read performance in training, difficult to match by centralized architectures.

Integration with GPU has also matured: GPUDirect Storage allows GPUs to read from NVMe-oF without passing through the CPU, reducing latency and freeing up cycles. Modern open source systems support these protocols. nativelyin proprietary solutions, they often rely on firmware y certifications that take quarters to arrive.

SIXE: sustainable open source, with or without commercial support

Migrating to large-scale open source storage is not trivial. Distributed systems require specific experience.

At SIXE we have been more than 20 years with Linux y open source. Like IBM Premier Partnerwe offer the best of both worlds:

  • IBM Storage Ceph e IBM Storage Scale (formerly Spectrum Scale/GPFS) for those who need Guaranteed SLAs, certifications y 24×7 global support.

  • Ceph upstream (and related technologies) for organizations that prefer maximum flexibility and control maximum flexibility and control.

It is not a contradictory position, but a strategic strategicdifferent profiles, different needs. A multinational bank values certifications and enterprise support. A research center with strong technical equipment can operate upstream directly.

Our intensive training at Ceph are hands-on workshops from three-dayThe workshops are three days: real clusters are deployed and design decisions. Knowledge transfer reduces the dependence on consultants and empower to the internal team. If your team still has little experience with Ceph, click here to see our initiation course, if on the other hand you want to get the most out of Ceph, we leave you here the advanced Ceph course, where your team will be able to integrate two crucial technological factors right now: Storage + AI.

 

Our philosophyWe do not sell technology, we transfer capacity. We deploy IBM Storage Ceph with full support, Ceph upstream with our specialized support or hybrid approacheson a case-by-case basis.

The opportunity for massive data and science

Several factors align:

  • The data is growing exponentially: a NovaSeq X Plus can generate 16 TB per run; the SKA telescope telescope will produce exabytes per yearAI models demand datasets datasets.

  • The budgets do not grow at the same pace. The capacity licensing models make unfeasible to scale proprietary systems at the required pace.

Open source solutions, whether upstream o commercially supported (e.g., IBM Storage Ceph), eliminates this dichotomy: growth is planned by hardware cost y operational capacitywith software whose costs do not do not scale linearly per terabyte.

Centers such as Fermilab, DESY, CERN itself CERN or the Barcelona Supercomputing Center have demonstrated that this approach is technically feasible y operationally superior for their cases. In its recent paper, CERN details multi-datacenter for DR with Ceph (stretch and multisite), achieving availability comparable to enterprise solutions, with flexibility and total control.

A maturing ecosystem: planning now

The open source storage ecosystem for HPC e AI is evolving fast:

  • Ceph Foundation (Linux Foundation) coordinates contributions from CERN, Bloomberg, DigitalOcean, OVH, IBMamong others, aligned with real production needs.

  • IBM maintains IBM Storage Ceph as a supported product and actively contributes upstream.

It is the ideal confluence between open source innovation y enterprise support. For organizations with a horizon of decadesthe question is no longer whether adopt open source, but when and when and how do so in a structured structured way.

The technology is matureThe technology is mature, the success stories are documented and support exists in both community community and commercial. What is often missing is the expertise to draw up the roadmap: model (upstream, commercial or hybrid), sizing, training y sustainable operation.

SIXE: your partner towards a storage that grows with you

At SIXE we work at that intersection. Like IBM Premier Partnerwe gain access to world-class support, roadmaps y certifications. At the same time, we maintain deep expertise in upstream and other ecosystem technologies, because there is no one-size-fits-all solution one-size-fits-all solution.

When a center contacts us, we don’t start with the catalog catalogbut with the key key questions:

  • What are your access patterns?

  • What growth project?

  • What capabilities does your equipment have?

  • What are the risks can you assume?

  • What is the budget budget (CapEx/OpEx)?

The answers guide the recommendation: IBM Storage Ceph with enterprise support, upstream with our support, a hybrid, or even evaluate if a traditional solution still makes sense in your case. We design solutions that work for 5 and 10 years, the important thing for us is to create durable and sustainable solutions over time ;)

Our commitment is to sustainable technologiestechnologies, not subject to commercial fluctuations, that provide control over infrastructure and scale technically and technically and economically.

The case of CERN is not an academic curiosity: it shows where storage for data-intensive loads is going. data-intensive loads. The question is not whether your organization will get there, but whether it will how will arrive: ready o on the run. The window of opportunity to plan calmly is open. open. The successes exist. The technology is ready. The ecosystem also. It remains to take the strategic decision to invest in infrastructures that will accompany your organization for decades to come. decades of data growth.

Contact us!

Does your organization generate massive volumes of data for AI o research? At SIXE we help research centers, universities, and innovative organizations to design, implement and operate storage scalable with Ceph, Storage Scale and other leading technologies, both upstream as with IBM business supportaccording to your needs. Contact us at for a no-obligation strategic consultation.

References

SIXE