Getting started with Kubernetes chaos engineering: Our mini chaos engineering test

Getting started with Kubernetes chaos engineering: Our mini chaos engineering test

8 January 2025

Natan Depauw

Key Takeaways

  • Chaos engineering doesn’t have to start big – begin with simple pod termination tests in a controlled environment.
  • Having a structured framework is essential before introducing any chaos, starting with defining your steady state.
  • Chaos Mesh provides a developer-friendly way to experiment with chaos in Kubernetes through its dashboard and custom resource definitions.
  • Proper monitoring needs to be in place before starting chaos experiments to understand system behavior and recovery patterns.
  • Team communication is crucial – everyone needs to understand what’s happening when chaos is introduced to the system.

Ever had that sinking feeling when a production incident hits and your team scrambles to understand what went wrong? We’ve all been there. At FlowFactor, we believe in tackling these challenges head-on, which is why our engineers recently experimented with chaos engineering in a Kubernetes environment. Here’s what we learned about taking those first steps into controlled chaos.

What is Chaos Engineering (And Why Should You Care)?

Chaos engineering is the discipline of experimenting on distributed systems to build confidence in their capability to withstand unexpected conditions. Think of it as a fire drill for your infrastructure – but instead of evacuating a building, you’re purposefully introducing controlled failures to understand how your system responds.

But let’s be real: while Netflix’s Chaos Monkey might grab headlines, most teams need to start smaller and more focused. That’s exactly what we set out to explore.

Our framework: A structured approach to chaos engineering

Before diving into experiments, we developed a structure to ensure our chaos engineering efforts would be methodical and valuable. Here’s our step-by-step approach:

  1. Define Steady State

    • Establish what normal looks like for your system
    • Document key metrics and behaviors
    • Set clear baseline measurements 
  2. Define Framework 
    • Choose appropriate tools
    • Focus on Kubernetes capabilities
    • Ensure self-service possibilities 
  3. What Chaos to Introduce 
    • Start with basic pod chaos
    • Consider HTTP & JVM faults/fault introduction.
    • Plan timed disruptions 
  4. Choose Environment 
    • Begin with test environments to get low hanging fruit
    • Move to production-like environments when ready
    • Always use dry-runs for validation 
  5. Test Deviation from Steady State 
    • Monitor system responses
    • Track recovery patterns
    • Document unexpected behaviors 
  6. Automate 
    • Implement schedules within the cluster
    • Create defined workflows
    • Consider CI/CD integration

Choosing Our Weapons: Chaos Mesh

After evaluating several options, we landed on Chaos Mesh, a CNCF incubator project. Here’s why:

  • Developer-friendly interface with clear experiment definitions
  • Native Kubernetes integration through Custom Resource Definitions (CRDs)
  • Support for both simple and complex chaos scenarios
  • Built-in scheduling and workflow capabilities
  • Possibility to target resources based on labels, annotations

Hands-On Demo: Chaos engineering in action

To put theory into practice, we set up a demo environment using a microservices-based online boutique application. Here’s a walkthrough of our chaos experiments:

Basic Pod Chaos

  1. Pod Kill Experiment 
    • Used label selectors to target specific services
    • Started with the ad service as our test subject
    • Observed pod termination and restart behavior 
  2. Pod Failure Testing 
    • Implemented temporary failures lasting 10 seconds
    • Tested service resilience without complete termination
    • Monitored automatic recovery processes

Namespace-Wide Testing

We then expanded our experiments:

  1. Scheduled Chaos 
    • Set up experiments using cron scheduling
    • Randomly selected pods within a namespace
    • Observed how multiple services handled disruption 
  2. Recovery Monitoring 
    • Watched how services recovered from failures
    • Documented restart behaviors
    • Identified areas needing monitoring improvements

Implementation Learnings

What Worked Well

  1. Self-service Model: Developers could create and run experiments independently since these types of chaos are created using custom resource
  2. Timed Disruptions: Ability to schedule chaos during business hours (because nobody wants 3 AM chaos)
  3. Dry-run Capability: Test experiment configurations without actual impact

Challenges We Faced

  1. OpenShift Integration: Initial setup required some tweaking for CRIO support. This should work out-of-the-box for Docker.
  2. Monitoring Setup: Realizing the importance of having proper monitoring in place before starting chaos experiments
  3. Team Communication: Ensuring everyone understands what’s happening when chaos is introduced

Moving Forward

Remember: chaos engineering isn’t about creating problems – but fixing them proactively. Start small, be methodical, and gradually build your chaos confidence.

No Comments

Sorry, the comment form is closed at this time.