by Srihari Sriraman (@ssrihari) on Sunday, 17 January 2016

+7
Vote on this proposal
Status: Confirmed & Scheduled
View session in schedule
Section
Full talk

Technical level
Intermediate

Media

Objective

Learn about the problems we will encounter while building or using postgres clusters for high availability, and how to solve them.

Description

What this talk is about

We engineered a Postgres database cluster last year. It was a lot of learning and a lot of fun! This talk is about the failure scenarios we designed for, the times when the designed system failed, and what we learnt from them.

A brief introduction to the talk:

  • Database clusters are built for one purpose – dealing with failure. Thinking about what can go wrong, designing for failure scenarios, and building multiple lines of defence was most of the work involved.

  • Building, instrumenting, monitoring and automating the setup of a database cluster isn’t easy. It involves many moving parts, each of which is subject to a certain amount of failure. We had to do this ourselves because there isn’t an existing solution out there.

  • Obviously, we ran into issues: The failover wasn’t quick enough, there were network issues, we had multiple masters, we had to recover from filesystem snapshots, wait days for standbys to catch up, etc. Each of these circumstances helped us understand and refine our cluster setup.

As an aside

  • Given the theme “learning from failure”, and given database systems is the first category mentioned, it feels like this talk would fit hand in glove.

Skeleton of the talk:

  1. Introduction to Postgres clusters
    • Introduce the cluster setup, it’s purpose, how it is expected to work, and the moving parts in the system.
    • [5 minutes]
  2. Postgres replication
    • Briefly explain “streaming replication”, then explain what can go wrong here. Hardware constraints, WAL config, long running queries on standbys, and timeouts. This will broadly cover the cases invovling two databases.
    • [10 minutes]
  3. Failover setup
    • Briefly explain what repmgr does, then explain what can go wrong. Multiple masters, no masters, automatic failover doesn’t work, node isn’t reachable, node is partially reachable, etc. This will cover the cases invovling at least 3 databases.
    • [10 minutes]
  4. Application <=> Database communication
    • Explain what can go wrong here, and then the Push/Pull mechanisms we built to deal with it.
    • [5 minutes]
  5. Disaster scenarios
    • What to do when the cluster is down, what to do to save your data, which backup/restore mechanism will work best for you, how to use filesystem backups, when not to rely on them.
    • [10 minutes]

Speaker bio

Srihari is a FOSS enthusiast. He has contributed to Gimp, Eclipse, Diaspora and is excited about opportunities to give back. Over the last couple of years, he has worked on building an experimentation platform, delving into a particularly dense domain, meeting tight latency SLAs, and engineering assembly lines in software using Clojure.

He loves postgres – he has worked on implementing a high availability solution using repmgr and postgres’ streaming replication, and has spent an inordinate amount of time optimizing queries.

He is a partner at nilenso, a hippie tree hugging bicycle riding software cooperative based in Bangalore. He blogs, plays basketball, and performs carnatic music occasionally.

Comments

  • 2
    [-] Zainab Bawa (@zainabbawa) a year ago

    Excellent job with submitting the outline. Thanks Srihari.

  • 1
    [-] Rahul Jain (@rahulj51) a year ago (edited a year ago)

    One of the best talks from yesterday’s lineup because the speaker didn’t talk about any fancy tools or stated the obvious. Also, great slides. Could have been shorter though.

Login with Twitter or Google to leave a comment