Getting started with Kubernetes chaos engineering: Our mini chaos engineering test
Getting started with Kubernetes chaos engineering: Our mini chaos engineering test
8 January 2025
Natan Depauw
Key Takeaways
- Chaos engineering doesn’t have to start big – begin with simple pod termination tests in a controlled environment.
- Having a structured framework is essential before introducing any chaos, starting with defining your steady state.
- Chaos Mesh provides a developer-friendly way to experiment with chaos in Kubernetes through its dashboard and custom resource definitions.
- Proper monitoring needs to be in place before starting chaos experiments to understand system behavior and recovery patterns.
- Team communication is crucial – everyone needs to understand what’s happening when chaos is introduced to the system.
Ever had that sinking feeling when a production incident hits and your team scrambles to understand what went wrong? We’ve all been there. At FlowFactor, we believe in tackling these challenges head-on, which is why our engineers recently experimented with chaos engineering in a Kubernetes environment. Here’s what we learned about taking those first steps into controlled chaos.
What is Chaos Engineering (And Why Should You Care)?
Chaos engineering is the discipline of experimenting on distributed systems to build confidence in their capability to withstand unexpected conditions. Think of it as a fire drill for your infrastructure – but instead of evacuating a building, you’re purposefully introducing controlled failures to understand how your system responds.
But let’s be real: while Netflix’s Chaos Monkey might grab headlines, most teams need to start smaller and more focused. That’s exactly what we set out to explore.
Our framework: A structured approach to chaos engineering
Before diving into experiments, we developed a structure to ensure our chaos engineering efforts would be methodical and valuable. Here’s our step-by-step approach:
- Define Steady State
- Establish what normal looks like for your system
- Document key metrics and behaviors
- Set clear baseline measurements
- Define Framework
- Choose appropriate tools
- Focus on Kubernetes capabilities
- Ensure self-service possibilities
- What Chaos to Introduce
- Start with basic pod chaos
- Consider HTTP & JVM faults/fault introduction.
- Plan timed disruptions
- Choose Environment
- Begin with test environments to get low hanging fruit
- Move to production-like environments when ready
- Always use dry-runs for validation
- Test Deviation from Steady State
- Monitor system responses
- Track recovery patterns
- Document unexpected behaviors
- Automate
- Implement schedules within the cluster
- Create defined workflows
- Consider CI/CD integration
Choosing Our Weapons: Chaos Mesh
After evaluating several options, we landed on Chaos Mesh, a CNCF incubator project. Here’s why:
- Developer-friendly interface with clear experiment definitions
- Native Kubernetes integration through Custom Resource Definitions (CRDs)
- Support for both simple and complex chaos scenarios
- Built-in scheduling and workflow capabilities
- Possibility to target resources based on labels, annotations
Hands-On Demo: Chaos engineering in action
To put theory into practice, we set up a demo environment using a microservices-based online boutique application. Here’s a walkthrough of our chaos experiments:
Basic Pod Chaos
- Pod Kill Experiment
- Used label selectors to target specific services
- Started with the ad service as our test subject
- Observed pod termination and restart behavior
- Pod Failure Testing
- Implemented temporary failures lasting 10 seconds
- Tested service resilience without complete termination
- Monitored automatic recovery processes
Namespace-Wide Testing
We then expanded our experiments:
- Scheduled Chaos
- Set up experiments using cron scheduling
- Randomly selected pods within a namespace
- Observed how multiple services handled disruption
- Recovery Monitoring
- Watched how services recovered from failures
- Documented restart behaviors
- Identified areas needing monitoring improvements
Implementation Learnings
What Worked Well
- Self-service Model: Developers could create and run experiments independently since these types of chaos are created using custom resource
- Timed Disruptions: Ability to schedule chaos during business hours (because nobody wants 3 AM chaos)
- Dry-run Capability: Test experiment configurations without actual impact
Challenges We Faced
- OpenShift Integration: Initial setup required some tweaking for CRIO support. This should work out-of-the-box for Docker.
- Monitoring Setup: Realizing the importance of having proper monitoring in place before starting chaos experiments
- Team Communication: Ensuring everyone understands what’s happening when chaos is introduced
Moving Forward
Remember: chaos engineering isn’t about creating problems – but fixing them proactively. Start small, be methodical, and gradually build your chaos confidence.
Sorry, the comment form is closed at this time.