by @diptanu (@diptanu) on Thursday, 25 February 2016
- Full talk
- Technical level
The talk introduces Chaos Engineering to the audience, and talks about how complex distributed systems fail in large scale internet services. The talk also goes into discussing design patterns for making higly resilient distributed systems which can heal from transient failures.
Complex Distributed Systems are hard to operate and has very complex failure modes. In this talk, we are going to discuss how we can build confidence in large scale distributed systems by introducing random but controlled failures in them in production and understand how services de-generate and work towards healing and recovering from failures automatically. We will also discuss patterns and various techniques for designing highly available and resilient distributed systems.
Diptanu is a Senior Engineer at HashiCorp, and works on large-scale distributed systems, cluster schedulers, service discovery and highly available and high throughput systems on the public cloud. He is a core committer to the Nomad cluster manager which has a parallel and distributed scheduler and supports heterogeneous virtualized workloads.
Prior to HashiCorp, Diptanu worked in the Cloud Platform group at Netflix, where he worked on the core platform infrastructure that powered the Microservices infrastructure of Netflix. He worked on Apache Mesos and wrote a cluster scheduler for running clusters of Docker containers on AWS, and also contributed to various reactive IPC and service discovery infrastructure projects.