Rootconf 2018

On scaling infrastructure and operations

How we scaled devops while we scaled 20x at SumoLogic

Submitted by Sarika Mohapatra (@sarikamm) on Saturday, 10 March 2018

videocam
Preview video

Technical level

Beginner

Section

Full talk

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +8

Abstract

SumoLogic is a unified logs and metrics platform for monitoring and troubleshooting and operational and security analytics. Ours is a cloud based microservices architecture that is multitenant in nature, analyzing upto 100+ petabytes, ingesting 100+ terabytes and serving 20+ million queries per day.

As a devops engineer in SumoLogic, we design, build, deploy, monitor our services all at the same time. To build for cloud scale in a microservices multitenant architecture that is dynamically changing and evolving, and to manage an ever-growing infrastructure, while being agile, has its unique challenges. Since our customers use SumoLogic for business critical needs, it is paramount that our service is always up and running.

So this talk is about the journey of scaling more than 20x, the engineering and operational challenges we faced and how we today run Sumo Search successfully, cost optimally and reliably while still being agile.

We will cover real case studies of what worked and what didn’t, go over our solutions and key lessons to build for scale. We will also cover how we do continuous testing and how we set up monitoring and alerting to run our services reliably.

Outline

  • Sumo - What we do and architecture overview
  • Biggest challenges in running Sumo Search reliably in the face of failing systems and unpredictability with concrete examples
    • Failures and unpredictability
    • Multitenancy
    • Scale
    • Being reliable while being agile
    • Upkeep of existing services with minimal manual effort
    • Operational KT
  • Solutions and lessons around the various dimensions of scale
    • Auto monitoring
    • Auto alerting
    • Auto remediation
    • Handling spikes
    • Blast radius control
    • Resource management
    • Configuration management
  • Continuous Testing
    • Continuous ITs
    • Performance & Reliability Testing
    • Shadow Testing
    • Dogfooding
  • How to setup your monitoring and troubleshooting system to meet your uptime goals and reduce your MTTR?
    • Logs and metrics collection
    • Setting up monitoring and observability for everything!
    • Alerting, troubleshooting and remediation
    • Feedback loop: outages, postmortems and how it influences our infrastructure and system design
  • Key takeaways:
    • Lessons for designing resilient and scalable services and running them reliably on production

Requirements

N/A

Speaker bio

Sarika is a Senior Software Engineer at SumoLogic where she is part of the search team that builds and runs petabyte scale multitenant cloud based log search and analytics service.

Prior to Sumo, Sarika graduated from IIT Kanpur and worked in Microsoft Apps Experience team, Microsoft Bing Search and Microsoft Research. In spare time, she invests time in her IoT projects and android apps.

Her passion lies in building high quality enterprise products that are indispensable to customers’ business. Besides building products, Sarika’s interests include sports, teaching, travel and pets!

Slides

https://docs.google.com/presentation/d/1ZsG_B14_5fCO7Qi9CYbKgNlDhbKs3MOBfICapQOkVnw/

Preview video

https://youtu.be/jt1Us8LLjao

Comments

  • 1
    Zainab Bawa (@zainabbawa) Reviewer 8 months ago

    Sarika, the slides aren’t accessible.

  • 1
    Sarika Mohapatra (@sarikamm) Proposer 8 months ago

    gave permissions. should be accessible now. thanks!

  • 1
    Sarika Mohapatra (@sarikamm) Proposer 8 months ago

    Hi Zainab, is there any other information that you need? Please let me know if it would help to haveore detailed slides. Thanks!

Login with Twitter or Google to leave a comment