by Imran Basha (@syedimranbasha) on Monday, 6 February 2017

+15
Vote on this proposal
Status: Submitted
Section
Full talk of 40 mins duration

Technical level
Intermediate

Media

Abstract

This talk is about Simple Workflow Service as an infrastructure support for developing Distributed Scalable Background Scheduled Jobs. We will teach how one can replace Cron based Workflow with Amazon Simple Workflow Service

Key take aways
What is Cron based Workflow ?
Issues in Cron based Workflow
What is SWF ?
How to replace Cron based workflow using SWF
How to Monitor workflows based on SWF using AWS CloudWatch?

Outline

Problem statement

We need background jobs that can run on multiple clusters on a daily scheduled basis. These jobs process millions of data every day. In order to better load balance across clusters we need to divide the data across all the clusters and than trigger the jobs to run on their portion of data within a cluster. This is explained in Slide #3 of the video. we were spawning Worker processs on the cluster machines from a master machine where Crontab files are setup. This is a typical requirement of running background job in distributed setup.

Issues with Cron based scheduling

  1. Lack of Failure handling
  2. lost tasks
  3. Scale
  4. Not an option on shared hosting setup
  5. Single point of failure

How SWF helped in our particular use case ?

Cron are good for running a job on that particular machine on a scheduled basis but when it comes to distributed execution in a Cluster setup we need co-ordination, failure handling, scalability etc… which doesn’t come out of box with Cron based solutions. SWF helps in creating a distributed workflow which we can run at scheduled intervals and submit commands to Worker processes running in individual machines which can pick up the tasks and start executing on it.

Benefits on using SWF

  1. SWF takes care of co-ordination between workflow and worker processes
  2. Architecture becomes scalable as state management is owned by SWF and there exists a loose coupling between workflow nad worker processes
  3. SWF is a better way of handling distributed execution as it provides Flow Framework for managing issues in dsitributed application like failures, retries etc…
  4. Solution has clear separation of concerns
  5. If any of the machine goes down the load automatically gets transferred to other machine
  6. SWF provides End-End solution including Monitoring metrics

This way we end in a loosely coupled, highly scalable, distributed solution with Co-ordination and State management taken care by Amazon SWF.

SWF cannot be described as Job Scheduler. Better way to describe SWF is it provides the necessary framework and services that enables us to create a distibuted workflow that can executed on multiple machines in a loosely coupled and high scalable manner.

I made an attempt in explaining the above things in the submitted video. Would be happy to clarify any subsequent questions.

Speaker bio

I am Imran. I have been working with Inutit since 6 years. Totally I have around 13 years of experience. I was fortunate enough to explore and contribute from breadth of Technologies to Depth. Primarily into Full stack web application development in both .Net using WEBAPI’s and Java based on Jersey, SPA application development based on Backbone, Marionette, React + Relay + GraphQL. I am a technology enthusiast. I was the Architect involved in migrating Cron based workflow to AWS SWF. I encountered lot of learnings in the journey of transformation migrating from an unreliable Cron based infrastructure to a Reliable, Distributed and Highly scalable architecture based on AWS SWF. Wanted to share the learnings so that it can benefit other people.

Comments

  • 2
    [-] Sankalp Verma (@sankalpv) 2 months ago

    would be a great topic to learn

  • 1
    [-] Philip Paeps (@trouble) 2 months ago

    This sounds a lot like “how to replace cron with something a lot more complicated”. I’m obviously a complete ignorant but I can’t figure out from your presentation why you would want to.

    Could you explain what category of problems SWF is trying to solve and how it solved your particular problem? Could you develop on some of the features SWF has and why one would want them? How is SWF different from any other distributed job scheduler? Why is it better? Why is it better for your particular usecase? What other options have you evaluated?

    • 1
      [-] Imran Basha (@syedimranbasha) 2 months ago

      Problem statement

      We need background jobs that can run on multiple clusters on a daily scheduled basis. These jobs process millions of data every day. In order to better load balance across clusters we need to divide the data across all the clusters and than trigger the jobs to run on their portion of data within a cluster. This is explained in Slide #3 of the video. we were spawning Worker processs on the cluster machines from a master machine where Crontab files are setup. This is a typical requirement of running background job in distributed setup.

      Issues with Cron based scheduling

      1. Lack of Failure handling
      2. lost tasks
      3. Scale
      4. Not an option on shared hosting setup
      5. Single point of failure

      How SWF helped in our particular use case ?

      Cron are good for running a job on that particular machine on a scheduled basis but when it comes to distributed execution in a Cluster setup we need co-ordination, failure handling, scalability etc… which doesn’t come out of box with Cron based solutions. SWF helps in creating a distributed workflow which we can run at scheduled intervals and submit commands to Worker processes running in individual machines which can pick up the tasks and start executing on it.

      Benefits on using SWF

      1. SWF takes care of co-ordination between workflow and worker processes
      2. Architecture becomes scalable as state management is owned by SWF and there exists a loose coupling between workflow nad worker processes
      3. SWF is a better way of handling distributed execution as it provides Flow Framework for managing issues in dsitributed application like failures, retries etc…
      4. Solution has clear separation of concerns
      5. If any of the machine goes down the load automatically gets transferred to other machine
      6. SWF provides End-End solution including Monitoring metrics

      This way we end in a loosely coupled, highly scalable, distributed solution with Co-ordination and State management taken care by Amazon SWF.

      SWF cannot be described as Job Scheduler. Better way to describe SWF is it provides the necessary framework and services that enables us to create a distibuted workflow that can executed on multiple machines in a loosely coupled and high scalable manner.

      I made an attempt in explaining the above things in the submitted video. Would be happy to clarify any subsequent questions.

  • 1
    [-] Imran Basha (@syedimranbasha) 2 months ago

    I have updated the slides with more details.

Login with Twitter or Google to leave a comment