Rootconf 2018

On scaling infrastructure and operations

Prevent Human Errors for 99.99% Availability

Submitted by Thripthy Antony (@thripthy) on Monday, 5 March 2018

videocam
Preview video

Technical level

Beginner

Section

Crisp Talk

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +3

Abstract

Most often outages due to human errors get brushed under the carpet as rare occurrences, where one overworked engineer who in the middle of his 7th activity of the day, went ahead and deleted the most crucial virtual IP configuration in your landscape. But this view is many times very far from truth. Most often reliability engineers are hit from multiple sides with multiple monitoring tools and availability matrices. And her judgement goes wrong, but most often only in hindsight. At that point of time, with the available information, it was the best choice! In this session, participants will understand how human errors should be analyzed and controlled. Human errors or handling errors give enterprises a chance to consider systemic issues in the enterprise and correct them for an always available service.

Outline

I will start the session with some of the famous human errors which caused the respective organizations considerable money and loss of reputation. From there I will move on to the strategies to understand handling errors and effective methods to prevent them.

We will discuss some strategies that would help organizations and teams to analyze and prevent human errors.
• Accident prone area - Go slow Automation fails. No matter how robust your scripts are there is chance for them to fail and your organization should be equipped to handle them manually when required, without error. While working on disaster recovery or high availability setups there is a big chance of human mistakes because of similar system names and multiple datacenters. So, add visual cues in your manuals that it an accident-prone procedure and schedule ample time.

Also, identifying such procedures helps you to plan no parallel activities and a comparatively free shift for the person executing it.

• Checklists I cannot emphasize enough how having thoughtful checklists saves your systems. Most often it’s the mundane tasks that are missed which results in serious outages. Because everybody knows them and that the steps are comparatively simpler, they don’t get documented. This omission will come back in the form of handling error outages later. Simple important steps must be part of a checklist at each stage.

• Fix the past, Fix the present and have monitoring in place So, what do you do when you identify a handling error. How to manage it so that it doesn’t occur again?
Most often people say someone missed something and move on. But that is not enough if you are targeting 99.999% availability for your services. A method that works, is that the problem management responsible in your organization to have an in-depth interview with the person who executed the same. A common mistake that occurs is the information comes from the manager of the team and an assurance that it will not happen again. But that is just scratching the surface of the symptom. The root cause for the error is much deeper.
Maybe there were confusing messages on the screen, maybe he was handling 3 other activities in parallel or maybe the tool did not throw an error message where it should have stopped the processor.

• Relook your shift handover plans Spend time on key aspects like,
Is there a shift lead available? Are there other activities planned for shift lead? Have you defined a process for shift handover? Is there a checklist available?

Key take away for the participants will be effective methods to handle human errors at workplace. I will also add some real life examples in each of the scenarios during the presentation.

Requirements

None

Speaker bio

I am working in Problem Management and Change Management in one of the Cloud Units at SAP as a Process Manager. I have 13 years of industry experience with a strong background in operations. We had been running a very unique initiative at my organization to control and prevent human errors to minimize production outages. I am heading the initiative now, and the insights we have gained while running this project is worth sharing with other teams and organizations looking for a zero outage service portfolio.

Links

Preview video

https://youtu.be/qKoQ3egBa08

Comments

  • 1
    Ramanan Balakrishnan (@ramananbalakrishnan) 8 months ago

    This seems like an interesting proposal.

    You highlight a lot of valid points - most issues do start with someone fat-fingering a config file somewhere.

    However, it seems like a lot of ground to cover, can you add draft slides on how everything would fit into one cohesive (actionable) story?

  • 1
    Thripthy Antony (@thripthy) Proposer 8 months ago

    Hi Ramanan, I have updated the draft slides and preview video.
    Regards, Thripthy

Login with Twitter or Google to leave a comment