Advanced Training

Ceph Production Operations | Course

When a 200TB cluster crashes at 3AM, you need answers—not theory

3 DAYS
Intensive
100%
Hands-on
REAL
Scenarios

Distribution-agnostic

IBM Storage Ceph, Red Hat, Ubuntu, Rocky, Alma Linux, or upstream Ceph

3:00 AM

CLUSTER CRITICAL

OSD Failure

12 OSDs down

CephFS

Metadata corrupt

Performance

IOPS -80%

Recovery

Plan active

You'll learn to solve:

Critical failures in 200TB+ clusters
Recovery of 40TB corrupted CephFS
Extreme tuning for AI/ML (500TB/day)
Troubleshooting under 24/7 pressure

Who is this for?

Certified administrators or those with production experience who need to master real-world critical scenarios that vendors don't teach.

Course structure

Intensive 3-day program designed to tackle real-world crises and optimize production clusters at petabyte scale

01

Advanced Performance Engineering & Forensics

From architecture to real forensic troubleshooting

Morning: Architectural Optimization

  • BlueStore internals: RocksDB tuning, compaction, write amplification
  • CPU optimization: C-states impact (labs showing 5x degradation), NUMA
  • Network: 100GbE patterns, TCP tuning, nf_conntrack
  • NVMe-specific: reactor tuning, bdevs_per_cluster optimization

Afternoon: Forensic Troubleshooting

  • Diagnostic toolchain: blktrace, perf, objectstore-tool
  • Real case studies: NVMe degradation, post-upgrade OSD flapping
  • Advanced PG lifecycle: stuck states, manual intervention
  • Labs: Cluster with real problems to diagnose
02

Disaster Recovery, Multi-Site & Petabyte Scaling

Extreme recovery and multi-site architectures

Morning: Advanced DR

  • Edinburgh 40TB case: complete error chain and recovery procedures
  • CephFS disasters: metadata corruption, MDS failure handling
  • RBD mirroring: pool vs image-based, failover automation
  • Physical DR: disk extraction, journal, whoami preservation

Afternoon: Multi-Site & Petabytes

  • RGW multisite: master zone failure, manual promotion, sync fairness
  • WAN planning: formulas for 1 GbE per 8TB daily ingest
  • Petabyte challenges: CERN 30PB (7,200 OSDs), 310M objects
  • Labs: Multi-site failover and recovery simulation
03

Security, AI/ML Workloads & Cost Engineering

Enterprise security and optimization for modern workloads

Morning: Security Hardening

  • Encryption: LUKS/dmcrypt OSDs, msgr2 secure, RGW SSE-S3/KMS
  • Key management: rotation (Squid 19.2.3+), Barbican integration
  • Compliance: HIPAA architecture, GDPR, audit logging
  • Threat detection: monitoring patterns, vulnerability management

Afternoon: AI/ML & ROI Engineering

  • S3 Select: Trino integration (2.5x-9x performance), analytics pushdown
  • AI/ML patterns: checkpointing, parallel access optimization
  • TCO analysis: EC efficiency, commodity hardware savings
  • Hybrid architectures: OpenStack DCN, edge-to-core, multi-cloud

Lab specifications

Realistic enterprise cloud infrastructure

🖥️ Infrastructure

  • Real 5-6 node cluster
  • 500GB+ pre-populated data per student
  • 24/7 access for 7+ days post-course

⚠️ Real scenarios

  • Disk failures & network partitions
  • Simulated metadata corruption
  • Injected performance degradation

🔧 Tools

  • blktrace, perf, objectstore-tool
  • Pre-installed debugging symbols
  • Real datasets with I/O patterns
Supported distributions and versions

Available distributions:

  • • Rocky Linux 9
  • • Ubuntu 24.04 LTS
  • • Red Hat Enterprise Linux

Ceph versions:

  • • Upstream Squid 19.2+
  • • IBM Storage Ceph 7.1
  • • Red Hat Ceph Storage 7.x

Upcoming Sessions

Intensive 3-day training designed for small groups (maximum 10 participants) to maximize interaction and collaborative troubleshooting

In-Person

At our facilities with full access to labs and specialized equipment

On-Site

At your organization for teams of 4+ people with customized configuration

Remote

With dedicated cloud lab and full access to real-time practice resources

Ready to stop fearing critical scenarios?

Request information about upcoming dates, detailed curriculum, and terms. Response guaranteed within 24 hours.

Or call us directly to answer your questions

Technical training at Ceph

Ceph Storage - The most comprehensive series of courses on the market

Ceph Administration

Ceph Administration

Fundamentals and deployment

See course →
Ceph Advanced

Ceph Advanced

Advanced configuration and EX260

See course →
Ceph Production Operations

Ceph Production Operations

Troubleshooting and DR

You are in the course →

Request this course at CEPH

FAQ

Frequently Asked Questions

Do I need to have taken your previous courses?
 

It's not mandatory, but you DO need equivalent knowledge. This course assumes you master: Ceph architecture (MON/OSD/MGR), pool/PG/CRUSH management, basic troubleshooting, and have practical experience managing clusters in production (2+ years or equivalent courses). If you completed our basic and advanced courses, you're perfectly prepared.

What if I don't have EX260 certification?
 

Certification is NOT a requirement. What matters is your real practical experience. If you've been administering Ceph in production for years, with or without managed services, and know the fundamental concepts well, this course is for you. In fact, many of our best students don't have certification but bring real production problems that we solve together.

Which Ceph distribution do you use?
 

The course is completely distribution-independent. Labs can be configured with IBM Storage Ceph, Ceph upstream (Squid 19.2+), Red Hat Ceph Storage, or whichever version you prefer. The troubleshooting, DR, and optimization techniques we teach are universal - they work the same on Rocky Linux, Ubuntu, RHEL or Alma Linux. You decide which configuration best matches your production environment.

How does it differ from your advanced course?
 

The advanced course covers deployment, advanced configuration, and preparation for EX260. This third course focuses 100% on critical production operations: forensic troubleshooting when everything fails, REAL disaster recovery (not simulations), advanced performance engineering, and complex multi-factor scenarios. They're complementary - think of the advanced course as "how to configure it well" and this one as "what to do when it fails badly".

What equipment do I need?
 

Laptop with SSH client, modern web browser, and stable internet access. The complete lab runs on enterprise cloud infrastructure - you don't need to install anything locally. We recommend 16GB RAM and a large screen (or dual monitors) to manage multiple terminals and windows simultaneously during troubleshooting.

Do you offer remote modality?
 

Yes. We offer three modalities: (1) In-person at our facilities for maximum interaction, (2) On-site at your organization for teams of 4+ people, and (3) Remote with dedicated cloud lab. The remote modality includes all the same practices and 24/7 lab access. Contact us to discuss which modality best fits your needs.

Is there a certificate or accreditation?
 

We issue a completion certificate with details of content and hours completed. We currently don't offer our own certification because the market still values demonstrable experience and vendor certifications (EX260, etc.) more. However, the skills you acquire here are verifiable in technical interviews and real situations, which is what really counts.

What if I can't solve the labs?
 

The labs are designed to challenge, not to frustrate. We work in small groups with direct instructor support. If you get stuck, that's part of the learning - we analyze together where you failed and why. The goal is for you to leave prepared for real scenarios, not to "pass" academic exercises. You maintain lab access for 7 days post-course to practice at your own pace.

SIXE