⚡ Advanced Training

Ceph Production Operations | Course

When a 200TB cluster crashes at 3 AM, you need answers —not theory

3 DAYS
Intensive
100%
Hands-on
REAL
Scenarios
🐧

Distribution-agnostic

IBM Storage Ceph, Red Hat, Ubuntu, Rocky, Alma Linux, or upstream Ceph

⚠️

3:00 AM

CLUSTER CRITICAL

💥

OSD Failure

12 OSDs down

📁

CephFS

Metadata corrupt

Performance

IOPS -80%

🔧

Recovery

Plan active

🎯 You'll learn to solve:

Critical failures in 200TB+ clusters
Recovery of 40TB corrupted CephFS
Extreme tuning for AI/ML (500TB/day)
Troubleshooting under 24/7 pressure

👥 Who is this for?

Certified administrators or those with production experience who need to master real-world critical scenarios that vendors don't teach

📚

Course Structure

An intensive 3-day program designed to tackle real-world crises and optimize production clusters at petabyte scale

01

Advanced Performance Engineering & Forensics

From architecture to forensic troubleshooting in production

☀️

Morning: Architectural Optimization

  • BlueStore internals: RocksDB tuning, compaction, write amplification
  • CPU optimization: C-states impact (labs showing 5x degradation), NUMA
  • Network: 100GbE patterns, TCP tuning, nf_conntrack
  • NVMe-specific: reactor tuning, bdevs_per_cluster optimization
🌅

Afternoon: Forensic Troubleshooting

  • Diagnostic toolchain: blktrace, perf, objectstore-tool
  • Real case studies: NVMe degradation, post-upgrade OSD flapping
  • Advanced PG lifecycle: stuck states, manual intervention
  • Labs: Cluster with real problems to diagnose
02

Disaster Recovery, Multi-Site & Petabyte Scaling

Extreme recovery and multi-site architectures

☀️

Morning: Advanced DR

  • Edinburgh 40TB case: complete error chain and recovery procedures
  • CephFS disasters: metadata corruption, MDS failure handling
  • RBD mirroring: pool vs image-based, failover automation
  • Physical DR: disk extraction, journal, whoami preservation
🌅

Afternoon: Multi-Site & Petabytes

  • RGW multisite: master zone failure, manual promotion, sync fairness
  • WAN planning: formulas for 1 GbE per 8TB daily ingest
  • Petabyte challenges: CERN 30PB (7,200 OSDs), 310M objects
  • Labs: Multi-site failover and recovery simulation
03

Security, AI/ML Workloads & Cost Engineering

Enterprise security and optimization for modern workloads

🔒

Morning: Security Hardening

  • Encryption: LUKS/dmcrypt OSDs, msgr2 secure, RGW SSE-S3/KMS
  • Key management: rotation (Squid 19.2.3+), Barbican integration
  • Compliance: HIPAA architecture, GDPR, audit logging
  • Threat detection: monitoring patterns, vulnerability management
🤖

Afternoon: AI/ML & ROI Engineering

  • S3 Select: Trino integration (2.5x-9x performance), analytics pushdown
  • AI/ML patterns: checkpointing, parallel access optimization
  • TCO analysis: EC efficiency, commodity hardware savings
  • Hybrid architectures: OpenStack DCN, edge-to-core, multi-cloud
🧪

Lab Specifications

Realistic enterprise cloud infrastructure

🖥️ Infrastructure

  • Real 5-6 node cluster
  • 500GB+ pre-populated data per student
  • 24/7 access for 7+ days post-course

⚠️ Real Scenarios

  • Disk failures & network partitions
  • Simulated metadata corruption
  • Injected performance degradation

🔧 Tools

  • blktrace, perf, objectstore-tool
  • Pre-installed debugging symbols
  • Real datasets with I/O patterns

🐧 Supported Distributions and Versions

Available distributions:

  • Rocky Linux 9
  • Ubuntu 24.04 LTS
  • Red Hat Enterprise Linux

Ceph versions:

  • Upstream Squid 19.2+
  • IBM Storage Ceph 7.1
  • Red Hat Ceph Storage 7.x
📅

Upcoming Sessions

Intensive 3-day training designed for small groups (maximum 10 participants)

to maximize interaction and collaborative troubleshooting
🏢

In-Person

At our facilities with full access to labs and specialized equipment

Hands-on Experience
🚀

On-Site

At your organization for teams of 4+ people with customized configuration

Team Training
🌐

Remote

With dedicated cloud lab and full access to real-time practice resources

Cloud Access
💪

Ready to handle critical scenarios with confidence?

Request information about upcoming dates, detailed curriculum, and terms. Response guaranteed within 24 hours

Or call us directly to answer your questions

📧 Email Support
💬 Live Chat
📅 Flexible Scheduling

Technical training at Ceph

Ceph Storage - The most comprehensive series of courses on the market

Ceph Administration

Ceph Administration

Fundamentals and deployment

See course →
Ceph Advanced

Ceph Advanced

Advanced configuration and EX260

See course →
Ceph Production Operations

Ceph Production Operations

Troubleshooting and DR

You are in the course →

Request this course at CEPH

FAQ

Frequently Asked Questions

It’s not mandatory, but you DO need equivalent knowledge. This course assumes you master: Ceph architecture (MON/OSD/MGR), pool/PG/CRUSH management, basic troubleshooting, and have practical experience managing clusters in production (2+ years or equivalent courses). If you completed our basic and advanced courses, you’re perfectly prepared.

Certification is NOT a requirement. What matters is your real practical experience. If you’ve been administering Ceph in production for years, with or without managed services, and know the fundamental concepts well, this course is for you. In fact, many of our best students don’t have certification but bring real production problems that we solve together.

The course is completely distribution-independent. Labs can be configured with IBM Storage Ceph, Ceph upstream (Squid 19.2+), Red Hat Ceph Storage, or whichever version you prefer. The troubleshooting, DR, and optimization techniques we teach are universal – they work the same on Rocky Linux, Ubuntu, RHEL or Alma Linux. You decide which configuration best matches your production environment.

The advanced course covers deployment, advanced configuration, and preparation for EX260. This third course focuses 100% on critical production operationsforensic troubleshooting when everything fails, REAL disaster recovery (not simulations), advanced performance engineering, and complex multi-factor scenarios. They’re complementary – think of the advanced course as “how to configure it well” and this one as “what to do when it fails badly”.

Laptop with SSH client, modern web browser, and stable internet access. The complete lab runs on enterprise cloud infrastructure – you don’t need to install anything locally. We recommend 16GB RAM and a large screen (or dual monitors) to manage multiple terminals and windows simultaneously during troubleshooting.

Yes. We offer three modalities: (1) In-person at our facilities for maximum interaction, (2) On-site at your organization for teams of 4+ people, and (3) Remote with dedicated cloud lab. The remote modality includes all the same practices and 24/7 lab access. Contact us to discuss which modality best fits your needs.

We issue a completion certificate with details of content and hours completed. We currently don’t offer our own certification because the market still values demonstrable experience and vendor certifications (EX260, etc.) more. However, the skills you acquire here are verifiable in technical interviews and real situations, which is what really counts.

The labs are designed to challenge, not to frustrate. We work in small groups with direct instructor support. If you get stuck, that’s part of the learning – we analyze together where you failed and why. The goal is for you to leave prepared for real scenarios, not to “pass” academic exercises. You maintain lab access for 7 days post-course to practice at your own pace.

Ceph course
SIXE