Containers, DevOps

Getting started with Kubernetes chaos engineering: Our mini chaos engineering test

8 January 2025 | by Natan Depauw | 0 Comments |

0 Likes

Getting started with Kubernetes chaos engineering: Our mini chaos engineering test

8 January 2025

Natan Depauw

Key Takeaways

Chaos engineering doesn’t have to start big – begin with simple pod termination tests in a controlled environment.
Having a structured framework is essential before introducing any chaos, starting with defining your steady state.
Chaos Mesh provides a developer-friendly way to experiment with chaos in Kubernetes through its dashboard and custom resource definitions.
Proper monitoring needs to be in place before starting chaos experiments to understand system behavior and recovery patterns.
Team communication is crucial – everyone needs to understand what’s happening when chaos is introduced to the system.

Ever had that sinking feeling when a production incident hits and your team scrambles to understand what went wrong? We’ve all been there. At FlowFactor, we believe in tackling these challenges head-on, which is why our engineers recently experimented with chaos engineering in a Kubernetes environment. Here’s what we learned about taking those first steps into controlled chaos.

What is Chaos Engineering (And Why Should You Care)?

Chaos engineering is the discipline of experimenting on distributed systems to build confidence in their capability to withstand unexpected conditions. Think of it as a fire drill for your infrastructure – but instead of evacuating a building, you’re purposefully introducing controlled failures to understand how your system responds.

But let’s be real: while Netflix’s Chaos Monkey might grab headlines, most teams need to start smaller and more focused. That’s exactly what we set out to explore.

Our framework: A structured approach to chaos engineering

Before diving into experiments, we developed a structure to ensure our chaos engineering efforts would be methodical and valuable. Here’s our step-by-step approach:

Define Steady State
- Establish what normal looks like for your system
- Document key metrics and behaviors
- Set clear baseline measurements
Define Framework
- Choose appropriate tools
- Focus on Kubernetes capabilities
- Ensure self-service possibilities
What Chaos to Introduce
- Start with basic pod chaos
- Consider HTTP & JVM faults/fault introduction.
- Plan timed disruptions
Choose Environment
- Begin with test environments to get low hanging fruit
- Move to production-like environments when ready
- Always use dry-runs for validation
Test Deviation from Steady State
- Monitor system responses
- Track recovery patterns
- Document unexpected behaviors
Automate
- Implement schedules within the cluster
- Create defined workflows
- Consider CI/CD integration

Choosing Our Weapons: Chaos Mesh

After evaluating several options, we landed on Chaos Mesh, a CNCF incubator project. Here’s why:

Developer-friendly interface with clear experiment definitions
Native Kubernetes integration through Custom Resource Definitions (CRDs)
Support for both simple and complex chaos scenarios
Built-in scheduling and workflow capabilities
Possibility to target resources based on labels, annotations

Hands-On Demo: Chaos engineering in action

To put theory into practice, we set up a demo environment using a microservices-based online boutique application. Here’s a walkthrough of our chaos experiments:

Basic Pod Chaos

Pod Kill Experiment
- Used label selectors to target specific services
- Started with the ad service as our test subject
- Observed pod termination and restart behavior
Pod Failure Testing
- Implemented temporary failures lasting 10 seconds
- Tested service resilience without complete termination
- Monitored automatic recovery processes

Namespace-Wide Testing

We then expanded our experiments:

Scheduled Chaos
- Set up experiments using cron scheduling
- Randomly selected pods within a namespace
- Observed how multiple services handled disruption
Recovery Monitoring
- Watched how services recovered from failures
- Documented restart behaviors
- Identified areas needing monitoring improvements

Implementation Learnings

What Worked Well

Self-service Model: Developers could create and run experiments independently since these types of chaos are created using custom resource
Timed Disruptions: Ability to schedule chaos during business hours (because nobody wants 3 AM chaos)
Dry-run Capability: Test experiment configurations without actual impact

Challenges We Faced

OpenShift Integration: Initial setup required some tweaking for CRIO support. This should work out-of-the-box for Docker.
Monitoring Setup: Realizing the importance of having proper monitoring in place before starting chaos experiments
Team Communication: Ensuring everyone understands what’s happening when chaos is introduced

Moving Forward

Remember: chaos engineering isn’t about creating problems – but fixing them proactively. Start small, be methodical, and gradually build your chaos confidence.

Let's have a chat

Terug naar overzicht

Getting started with Kubernetes chaos engineering: Our mini chaos engineering test

Getting started with Kubernetes chaos engineering: Our mini chaos engineering test

Key Takeaways

What is Chaos Engineering (And Why Should You Care)?

Our framework: A structured approach to chaos engineering

Choosing Our Weapons: Chaos Mesh

Hands-On Demo: Chaos engineering in action

Basic Pod Chaos

Namespace-Wide Testing

Implementation Learnings

What Worked Well

Challenges We Faced

Moving Forward

Recent Posts

AWS Resource Listing Script: Find and manage AWS resources across regions

Stop learning about outages from customers with our Prometheus Grafana Monitoring Script

Our Road to NIS2: The DevOps Way

Chaos engineering: your first steps toward system resilience with DevOps

What The FlowFactor Podcast E01 : Shift left, not sh*t left: Does shift left mean dumping more work on developers?

AI Code Assistants for DevOps: Comparing Gemini Code Assist, Copilot and CodeWhisperer

Platform Teams: How to Do Things Right

Refactoring Development, Part Two: Are You the Problem?

No Comments

SUBSCRIBE
TO DEVOPS ANONYMOUS

Getting started with Kubernetes chaos engineering: Our mini chaos engineering test

Getting started with Kubernetes chaos engineering: Our mini chaos engineering test

Key Takeaways

What is Chaos Engineering (And Why Should You Care)?

Our framework: A structured approach to chaos engineering

Choosing Our Weapons: Chaos Mesh

Hands-On Demo: Chaos engineering in action

Basic Pod Chaos

Namespace-Wide Testing

Implementation Learnings

What Worked Well

Challenges We Faced

Moving Forward

Recent Posts

No Comments

SUBSCRIBE TO DEVOPS ANONYMOUS

SUBSCRIBE
TO DEVOPS ANONYMOUS