How Appdynamics runs a Multi-tenant Kubernetes+Helm cluster with continuous deployment & monitoring
Submitted by Prateek Agarwal (@prat0318) on Sunday, 11 March 2018
AppDynamics develops application performance management (APM) solutions that deliver problem resolution for highly distributed applications. Our platform is able to dynamically collect millions of performance data points across users’ applications and infrastructure. As a result of this, scaling our data platform architecture and making it reliable and fault resilient becomes crucial to the company’s success.
The talk starts with the scale at which our data platform operates, and the pain points teams started facing after onboarding more and more customers. Then the talk goes through the different container orchestration frameworks evaluated and why Kubernetes was chosen. Then the talk discusses the design requirements to ensure each platform subteam had enough freedom and isolation to develop a service from scratch and deploy it on production on their own increasing team’s velocity significantly. The talk discusses the workflow hence developed to build such a PaaS framework used by different teams to run their services deployed as just another tenant on a multi-tenant kubernetes cluster.
The talk then goes through the capabilities given to the newly onbarded team in the cluster. The common CI/CD pipeline, alerting, monitoring and logging framework designed for the cluster can be leveraged by every team independently of each other. The talk will then showcase how it manages a canary-like kubernetes setup for production deployments. Finally the talk concludes with the lessons learned while building such a workflow and while fighting a few production fires.
- About the speaker and Appdynamics
- Scale at which platform team operates
- Millions of metrics uploaded per minute
- Reliablity and resiliency guarantees
- Pain points with the earlier architecture
- Typical limitations with a monolith - team collisions, scalability
- Zero downtime while upgrades was not possible
- Frequent outages
- Why Kubernetes?
- Comparision points with other orchestration frameworks
- Why kubernetes shined among all.
- Initial design requirements
- Reduce boilerplate code while running a new service
- Team resource isolation
- Ability to run on AWS and on-premise
- Provide CI/CD, logging, alerting, monitoring capabilities to the teams
- Workflow from a code commit to the final deployment
- A pull-request to the Helm chart repository
- Teamcity Build pipeline runs a minikube cluster to verify the cluster health after PR
- Chart artifact is uploaded to AWS S3 on a successful PR merge
- a SQS event is triggered for a new S3 insert which kicks off a Jenkins pipeline
- Jenkins then automatically deploys the helm chart on the staging environments
- Production deployments still remain manual (discussed later).
- Alerting, Logging and Monitoring
- How all container logs are collected and sent to Splunk
- How we use Prometheus as well as Appdynamics monitoring to collect all application and cluster level metrics
- How we use Alertmanager to route alerts to Slack / emails and also PagerDuty.
- Production canary setup
- How we route traffic between two kubernetes clusters - one acting canary and other primary
- How it helps in testing production deployments
- How it also acts a standby to fallback in case of outage on the primary
- Lessons Learned
- Initial hurdles faced
- Challenges in bringing new teams to onboard
- Few insights on production issues seen.
- Basic knowledge of kubernetes terms
Prateek is a Senior Software Engineer at Appdynamics and is part of the platform Infrastructure team. His primary responsibilities include designing systems to help teams run their services smoothly on the kubernetes cluster. He also takes care of automating cluster setup on both AWS cloud and on-premise.
Prior to Appdynamics, Prateek finished his bachelors from IIT Kharagpur and masters from UT Austin. He has worked with IBM, Flipkart and Yelp.com as an infrastructure engineer.
His interests lie in distributed systems like Cassandra, Kafka, Zookeeper, ElasticSearch and distributed tracing systems like Zipkin.