Any ideas on how to run maintenance on a site that is always under use?

Solution 1:

There are a lot of things you could be doing to improve your deployment process. A few of them are:

  • Ensure your code is well tested.

    Ideally you should have 100% unit test coverage, as well as integration testing for every conceivable scenario.

    If you haven't got this, you should probably drop everything and get this taken care of.

    Look into behavior-driven development.

    Having a complete test suite will allow you to...

  • Run continuous integration.

    Whenever someone commits a change, CI can then automatically run the test suite on it. If the test suite passes, it can then deploy immediately (or schedule a deployment). For changes that don't require any significant change to your databases, this alone will save you a lot of time and headache.

    In case of a problem, CI can also give you a one-click rollback.

    CI is much less useful if your test suite isn't complete and correct, as the entire premise rests on being able to validate your code in an automated way.

  • Make atomic updates.

    Ideally you should not just be copying new files over the old on the production server. Instead, use a tool such as capistrano, which copies every file to a new location, and then uses a symbolic link to point to the desired deployment. Rolling back is instantaneous as it involves simply changing the symlink to point to the previous deployment. (Though this doesn't necessarily cover your database migration.)

    Also look into whether containers such as Docker can help you.

  • Make smaller, more frequent changes.

    Whether you have tests, CI, or nothing, this alone can help you significantly. Every change should have its own git branch, and a deployment should have as few changes as possible. Because changes are smaller, there is less to potentially go wrong during a deployment.

    On that note, make changes more isolated whenever possible. If you've made a change to the Omaha game, and it doesn't affect Texas Hold'em, 5 card stud or anything else, then that is the only game that needs to be suspended for a maintenance.

  • Analyze anything long-running.

    You mentioned some parts of your deployments take a long time. This is probably database schema changes. It's well worth having a DBA look at your database, along with each schema change, to see what can be performing better.

    Have a subject matter expert look at any other part of a deployment which takes up large blocks of time.

  • Work odd hours.

    You may already be doing this, but it bears mentioning. Developers (and sysadmins!) should not be expected to work "9 to 5" anymore, especially for a 24x7 operation. If someone is expected to spend the overnight hours babysitting a deployment, fixing any problems, and then keep a daytime schedule, your expectations are unrealistic, and you are setting that person up for burnout.

Solution 2:

It seems from what you say that you have a maintenance window from 1 am to 7 am every day the issue is not time but convenience. This is normal and many people just deal with it as part of business.

You could have a 2 (or more backend) systems with a front end that directs traffic to whichever is currently live. Once you are happy that a release is going to work you tell the front end to switch to the new system. this should be easy to script an take a short time.

Now you have a choice of either leaving the old system as is so you can back out or bring it up-to-date so it can be used as a spare for the live system until it's time to build/test the next updates.

Solution 3:

Amending the other answers: You should follow the blue-green deployment model. When you want to release a new version you deploy it to an internal staging website. Then, you can run automated tests on the next version production site. When the tests go through you point the load balancer to use the new website.

This helps in the following way:

  1. Severe problems are always found with zero downtime.
  2. Switching to a new version has exactly zero downtime because the new version is already started and warmed up.
  3. You can switch back to the old version at any time because it is still physically running.

All the other problems that you and others have mentioned becomes less severe when you can deploy at any time in a stress-free manner. The blue-green deployment model is a quite complete solution for deployment problems.

Solution 4:

What will you do if your main data centre suffers an outage, which happens at all data centres from time to time? You might accept the downtime, you might fail over to another data centre, you might be running in active-active mode in multiple data centres all the time, or you might have some other plan. Whichever one of those it is, do it when you do releases, and then you can take your main data centre down during a release. If you're prepared to have downtime when your data centre has an outage, then you're prepared to have downtime, so it shouldn't be a problem during a release.