Outgrowing cron: what's the next scheduler? [closed]

Condor, OGE, and Torque can all get you there but only Condor has built-in dependency management with it's DAGMan tool. DAGMan lets you set up a directed, acyclic graph that describes your work flow and the manager takes care of moving through jobs in your workflow and evaluating pass/fail results at each step in the flow. Condor is relatively platform agnostic, which means DAGMan is too, and you can certainly have one child step run on AIX when the parent ran on Linux or Windows. DAGMan isn't concerned with where jobs run, just that exit codes are pass or fail.

Any tips for choosing the software or whether it is better to go open source or commercial?

With some caveats I think the free communities in this space are well worth looking at.

OGE is in a weird space now. It's no longer free to run the Oracle-produced GE variant and Oracle is no longer contributing code it writes back to the GE SCC, but there are several forks of the code that exist that are trying to soldier on as free, open source projects. Univa in particular has lead the charge, hiring ex-Sun GE devs to continue to work on an open source, freely available GE variant. Grid Engine has two things going for it: it's easy to setup, it can handle short running (<2 minute) jobs without imparting a lot of scheduling overhead on the jobs that slows down throughput. It's big downside is there is not very good support for Windows. Some of us put some efforts in to porting it to run on Cygwin many years ago, but it's not as good as native that's for sure.

Now Condor is my favourite of the three technologies you mentioned. There's a strong community around Condor and the software is very mature (>20 years old now). Native Windows and POSIX OS support means it runs all over the place very well. The aforementioned DAGMan is just one of the many great pieces that come with Condor. It can be a touch complicated to set up, but once it's up and running it's rock solid. It has an incredibly flexible language for doing job <-> machine matching and building your use rules for your resources. It also supports dynamic provisioning on machines, letting jobs select how much of machines resources they need and then re-advertising the difference as being still available. It supports global resource counters so you can constrain against things like software licenses. And of course, it has DAGMan, which is an incredibly powerful tool for workflow management. The downside to Condor is the scheduling overhead for short-running jobs can be burdensome. You want jobs that run longer than 2 minutes ideally, otherwise scheduling starts to become a big part of the job's time in the system.

Torque is a little more niche. I know less about it I'm afraid. It compares more to Grid Engine than Condor. There are paid add-ons that @warren mentioned that can expand what the basic, free Torque can do.

If you want to try out the three technologies and see how they work with your specific workloads, CycleCloud can spin up secure, virtualized, pools that are pre-configured with Condor, GridEngine or Torque -- so no time spent in figuring that stuff out on your part. It'd be a few dollars to spin up small pools of each technology and try them with representative workloads. (Disclaimer: I work for Cycle Computing, we make CycleCloud)


Chronos looks very promising.

Chronos is Airbnb's replacement for cron. It is a distributed and fault-tolerant scheduler that runs on top of Apache Mesos. You can use it to orchestrate jobs. It supports custom Mesos executors as well as the default command executor. Thus by default, Chronos executes sh (on most systems bash) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the Mesos slaves on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchronous callbacks to notify Chronos of job completion or failures.

I've also head great personal success using Jenkins as a cron replacement. It handles executing jobs on remote servers quite nicely. Here's a writeup on it: http://www.22ideastreet.com/blog/2014/05/02/replace-local-cron-with-jenkins/