Advanced Training

Ceph Production Operations | Course

When a 200TB cluster crashes at 3AM, you need answers—not theory

3 DAYS

Intensive

100%

Hands-on

REAL

Scenarios

Distribution-agnostic

IBM Storage Ceph, Red Hat, Ubuntu, Rocky, Alma Linux, or upstream Ceph

3:00 AM

CLUSTER CRITICAL

OSD Failure

12 OSDs down

CephFS

Metadata corrupt

Performance

IOPS -80%

Recovery

Plan active

You'll learn to solve:

Critical failures in 200TB+ clusters

Recovery of 40TB corrupted CephFS

Extreme tuning for AI/ML (500TB/day)

Troubleshooting under 24/7 pressure

Who is this for?

Certified administrators or those with production experience who need to master real-world critical scenarios that vendors don't teach.

Course structure

Intensive 3-day program designed to tackle real-world crises and optimize production clusters at petabyte scale

Advanced Performance Engineering & Forensics

From architecture to real forensic troubleshooting

Morning: Architectural Optimization

• BlueStore internals: RocksDB tuning, compaction, write amplification
• CPU optimization: C-states impact (labs showing 5x degradation), NUMA
• Network: 100GbE patterns, TCP tuning, nf_conntrack
• NVMe-specific: reactor tuning, bdevs_per_cluster optimization

Afternoon: Forensic Troubleshooting

• Diagnostic toolchain: blktrace, perf, objectstore-tool
• Real case studies: NVMe degradation, post-upgrade OSD flapping
• Advanced PG lifecycle: stuck states, manual intervention
• Labs: Cluster with real problems to diagnose

Disaster Recovery, Multi-Site & Petabyte Scaling

Extreme recovery and multi-site architectures

Morning: Advanced DR

• Edinburgh 40TB case: complete error chain and recovery procedures
• CephFS disasters: metadata corruption, MDS failure handling
• RBD mirroring: pool vs image-based, failover automation
• Physical DR: disk extraction, journal, whoami preservation

Afternoon: Multi-Site & Petabytes

• RGW multisite: master zone failure, manual promotion, sync fairness
• WAN planning: formulas for 1 GbE per 8TB daily ingest
• Petabyte challenges: CERN 30PB (7,200 OSDs), 310M objects
• Labs: Multi-site failover and recovery simulation

Security, AI/ML Workloads & Cost Engineering

Enterprise security and optimization for modern workloads

Morning: Security Hardening

• Encryption: LUKS/dmcrypt OSDs, msgr2 secure, RGW SSE-S3/KMS
• Key management: rotation (Squid 19.2.3+), Barbican integration
• Compliance: HIPAA architecture, GDPR, audit logging
• Threat detection: monitoring patterns, vulnerability management

Afternoon: AI/ML & ROI Engineering

• S3 Select: Trino integration (2.5x-9x performance), analytics pushdown
• AI/ML patterns: checkpointing, parallel access optimization
• TCO analysis: EC efficiency, commodity hardware savings
• Hybrid architectures: OpenStack DCN, edge-to-core, multi-cloud

Lab specifications

Realistic enterprise cloud infrastructure

🖥️ Infrastructure

• Real 5-6 node cluster
• 500GB+ pre-populated data per student
• 24/7 access for 7+ days post-course

⚠️ Real scenarios

• Disk failures & network partitions
• Simulated metadata corruption
• Injected performance degradation

🔧 Tools

• blktrace, perf, objectstore-tool
• Pre-installed debugging symbols
• Real datasets with I/O patterns

Supported distributions and versions

Available distributions:

• Rocky Linux 9
• Ubuntu 24.04 LTS
• Red Hat Enterprise Linux

Ceph versions:

• Upstream Squid 19.2+
• IBM Storage Ceph 7.1
• Red Hat Ceph Storage 7.x

Upcoming Sessions

Intensive 3-day training designed for small groups (maximum 10 participants) to maximize interaction and collaborative troubleshooting

In-Person

At our facilities with full access to labs and specialized equipment

On-Site

At your organization for teams of 4+ people with customized configuration

Remote

With dedicated cloud lab and full access to real-time practice resources

Ready to stop fearing critical scenarios?

Request information about upcoming dates, detailed curriculum, and terms. Response guaranteed within 24 hours.

Request Full Information

Or call us directly to answer your questions

+34 91 198 02 43

Ceph Production Operations | Course

Distribution-agnostic

3:00 AM

You'll learn to solve:

Who is this for?

Course structure

Advanced Performance Engineering & Forensics

Morning: Architectural Optimization

Afternoon: Forensic Troubleshooting

Disaster Recovery, Multi-Site & Petabyte Scaling

Morning: Advanced DR

Afternoon: Multi-Site & Petabytes

Security, AI/ML Workloads & Cost Engineering

Morning: Security Hardening

Afternoon: AI/ML & ROI Engineering

Lab specifications

🖥️ Infrastructure

⚠️ Real scenarios

🔧 Tools

Upcoming Sessions

In-Person

On-Site

Remote

Ready to stop fearing critical scenarios?

Technical training at Ceph

Ceph Administration

Ceph Advanced

Ceph Production Operations

Request this course at CEPH

FAQ

Frequently Asked Questions

Do I need to have taken your previous courses?

What if I don't have EX260 certification?

Which Ceph distribution do you use?

How does it differ from your advanced course?

What equipment do I need?

Do you offer remote modality?

Is there a certificate or accreditation?

What if I can't solve the labs?

Blog!

Contact us!

Partners

Our mission