Production is Priority - Self Fix / Heal Techniques
Submitted by Jabir Ahmed (@jabirahmed) on Wednesday, 29 January 2014
To understand how to
- Monitor systems
a. Nagios b. Ganglia
- Analyse Root cause
- Automate the fix
- Log / Record Incidents
Production systems are always P1 and keeping them up & scaling them is what keeps everyone on their toes
How ever we have cracked some important automation that could drastically make a devOps engineers life easier.
We let our 1000+ servers across 4 regions heal by themselves, and let the Operations team focus on bigger tasks that could add more impact to the organisation.
This ensures that we are not doing the same task over and over again, increases productivity and scalability across the application stacks.
A simple example would be something like log rotate, which ensures that we don't keep cleaning logs every day but it does that task over and over again on your behalf to ensure logs get purged everyday
Question : I have a use-case that does not have a solution in the open source community..
Answer : Customise it.. you would be able to plugging scripts and hooks to fix the problem.
Will discuss on how its done by us!
- Any flavor of Linux
- Bash/Shell Scripting
- Scripting Perl / Python
- Programming / Automation
- ActiveMQ added advantage.
Hadoop Big Data Platform Team @ Inmobi
Hadoop System Engineer, Yahoo, Bangalore