by Cyrus Dasadia (@extremeunix) on Monday, February 24, 2014

+30
Vote on this proposal
Status: Confirmed & Scheduled
View session in schedule
Section
Crisp talk

Technical level
Beginner

Media

Objective

Get rid of pesky duplicate alerts and fiddling through the runbooks. Try the new CitoEngine! It eats up all your alerts, takes actions based on smart rules you define and helps the environment[citation_needed]

Description

When you were young and roaming around at night, like any decent kid, your parents would probably call asking when you were coming home. Now even though you said you would be back in 10 minutes, your parents would nag you by calling every hour, again and again! Totally redundant, right?

Now that you are an adult, you feel pretty much the same frustration when your systems are alerting. You keep getting paged, get calls from NOC, OpsGenie, Satan, PagerDuty, etc. to remind you of the impending doom caused when your '/var partition is at 100%'? If that's not enough, now your Boss wants to know what's going on! You have to mute the alarms in a gazillion places, tell NOC that you are working on the issue, go run some command from your server's bash_history and go back to sleep, until you get paged again for something else.

What if you had a system that would accept such alarms, invoke the tools and scripts to mitigate the problem and clean up after that? What if there was a tool, which at the least, would know when to page you and when to inform NOC, based on the number of times the alert came? What if you didn't have to work your notification policies in Nagios(eww!) but let a genetically superior system take care of it? What if you could do all this with an open source application?

If you have read so far, then you definitely need help! I had the same set of problems, and that led me to create CitoEngine. Let me show you the path to a hassle free alert management system that is 'CitoEngine'. In this talk, I will be explaining my approach to solving this problem and (if time permits) a quick demo of the tool.

Speaker bio

I have been cleaning /var partitions since '96. This was the time when Squid proxies were life savers (still are!), RAM was Rs. 1000 per MB, internet was 33.6kbps and setting up sendmail servers got you Rs. 25,000. I have seen technology evolve from dreamy theories to actual mainstream products but, at the same time, a lot of the practices still remain the same. With the better part of my career as a System Administrator, I know the pains for managing infrastructure.

I work for InMobi, solving operational problems.

Comments

  • 1
    [-] Sreekandh Balakrishnan (@gnuyoga) 3 years ago

    This would definetly a interesting topic to hear. Did u release the tool open source ? If not are you sharing just the approach on how you have solved it ? Is there any other alternate that you have seen evolving in the market ?

    • 1
      [-] Cyrus Dasadia (@extremeunix) 3 years ago

      hey Sreekanth, the code will be released by April 1st on the Github link pasted above. The talk will consist of my approach as well as a demo of the tool (I've updated the description). Actually, there is one tool out there, http://riemann.io/, which provides a huge list of features, one of which could be used to solve the above problem.

  • 1
    [-] Sreekandh Balakrishnan (@gnuyoga) 3 years ago

    Awesome !!!

    Looking forward to a early draft of the proposal. Slideshare ? or youtube link.

  • 1
    [-] Sarath Raman (@sarather) 3 years ago

    Awesome idea, Looking forward to it!

  • 1
    [-] Sreekandh Balakrishnan (@gnuyoga) 2 years ago

    did u release the code on April 1st ?

    • 1
      [-] Cyrus Dasadia (@extremeunix) 2 years ago

      Not yet, Sreekanth. I am targetting last week of April as it would give me enough time to finish off with the unittests, documentation, sample plugins, etc.

  • 1
    [-] Sreekandh Balakrishnan (@gnuyoga) 2 years ago

    Summary of our call
    1. Whats our problem statement, How did you arrive at wanting to build this ? ( While managing 60K servers at AOL ? )
    2. What are the different kind of alerts ( can u state some examples )
    3. What are the tools available to manage different alerts
    4. Is it closer to CloudWatch, M/Monit or Sensu in any way ?
    5. Talk more about your experience of dealing servers than tooling
    6. Demo live usage of your tool in InMobi ( take care of internal NDA ;) )
    7. What value this has got to the customer ( in this case InMobi )
    8. Do we have a dashboard of sorts to see whats happeninng across the servers ?
    9. How scalable is your code right now
    10. Open Source or source open ;)

  • 1
    [-] Cyrus Dasadia (@extremeunix) 2 years ago

    Uploaded the presentation on slideshare.

Login with Twitter or Google to leave a comment