by Jabir Ahmed (@jabirahmed) on Wednesday, 29 January 2014

+18
Vote on this proposal
Status: Confirmed & Scheduled
View session in schedule
Section
Full talk

Technical level
Intermediate

Objective

To understand how to

  1. Monitor systems
    a. Nagios b. Ganglia
  2. Analyse Root cause
  3. Automate the fix
  4. Log / Record Incidents

Description

Production systems are always P1 and keeping them up & scaling them is what keeps everyone on their toes

How ever we have cracked some important automation that could drastically make a devOps engineers life easier.

We let our 1000+ servers across 4 regions heal by themselves, and let the Operations team focus on bigger tasks that could add more impact to the organisation.

This ensures that we are not doing the same task over and over again, increases productivity and scalability across the application stacks.

A simple example would be something like log rotate, which ensures that we don't keep cleaning logs every day but it does that task over and over again on your behalf to ensure logs get purged everyday

Question : I have a use-case that does not have a solution in the open source community..

Answer : Customise it.. you would be able to plugging scripts and hooks to fix the problem.

Will discuss on how its done by us!

Requirements

Knowledge

  1. Nagios.
  2. Any flavor of Linux
  3. Bash/Shell Scripting
  4. Scripting Perl / Python
  5. Programming / Automation
  6. ActiveMQ added advantage.

Speaker bio

Jabir Ahmed.
Hadoop Big Data Platform Team @ Inmobi

Tech Lead
Hadoop System Engineer, Yahoo, Bangalore

http://in.linkedin.com/in/jabirahmed/

Comments

  • 1
    [-] Sreekandh Balakrishnan (@gnuyoga) 3 years ago

    can you expand the point 2 and point 3 please

Login with Twitter or Google to leave a comment