MMORPG server maintenance

maintenance

It seems that most mmorpg games have some regular server maintenance, some every day, some once a week. What is it that they actually have to do, and why is it necessary ?

If you start with such a project what can you do to avoid this ?

I suspect that they're deploying the latest version of their code, which requires that they restart the application (and hopefully running some tests before re-enabling access). From that point of view, it's more of a StackOverflow problem and less of a ServerFault one.

I think it's possible to create a hot-patching system, but it would necessarily be incredibly complicated. From what I understand, an MMO server "application" consists of several different components --

Login server -- Handles authentication and acts as a "hub" between gameplay servers. Once a client is in-game they no longer interact with the login server. In such a system you could apply patches and restart the login server without interfering with gameplay (though you'll have a period of time where people won't be able to log in).
Gameplay servers -- Clusters of machines grouped into logical independent units ("worlds", etc). It's assumed that the each gameplay cluster uses some kind of internal communication protocol to correspond state to one another; you're probably going to have to patch each cluster all at once. One possible way to do this is to patch a warm-failover. You'd then need to be able to both
1. Signal the client to connect to the warm failover and disconnect from the old cluster.
2. Keep the state synched between the failover and the out-of-date application server while all of the clients transfer.
Database servers -- Some kind of persistent datastore, like an RDBMS. Hopefully you're not making changes to the datastore that often. Presumably each gameplay server/cluster has an independent datastore. You might be able to use the same trick with a warm failover (and tell the gameplay servers to disconnect, wait for the old and failover databases to sync, then reconnect to the failover) but that seems pretty risky to me.

All of the above cases add an incredible amount of complexity to an already complex system and introduce a bunch of places where a code failure can cause data loss or corruption.

Another solution is to use a language which is designed for 100% uptime and has built-in capabilties for hotpatching running code. Erlang is a good choice (hotpatching example), and Java has similar functionality.

No one else has experience actually running something like this? Huh.

There's several reasons that bridge both code and systems. First, remember that most of the current 'big' MMO engines were programmed several years ago, and despite graphics and technology upgrades since, still depend on the way many of these systems were written in 2000 or so. Eve-Online, for instance, still runs on one huge Microsoft SQL Server instance, which is why they're always trying to pull more out of it by upgrading hardware.

An example of an improvement since WoW and EVE got started is the work done in distributed key/value databases like Google's MapReduce (and it's open-source implementation, Hadoop), extremely fast affirmative response processing queue services (Amazon SQS), and other "cloud"-oriented technologies.

I have the most experience with EVE (I'm more of a lasers guy than a battleaxes guy), so some of these examples are more EVE-oriented.

As far as Systems reasons go:

Physical nodes fail on a consistent basis. When a node fails, typically it's activity is migrated elsewhere using any number of means. However, the node needs to be put back into service as quickly as possible. In EVE's case, they use both a stackless processing language and virtual servers; I'm not sure what Blizzard's architecture is like.
Database consistency needs to get checked, logs need to get flushed, and indexes and data caches need to get rebuilt. This is especially important in a system like EVE with only one "live" database instance.
Operating system patches need to be applied at a time when they can reboot nodes without having to have too much activity migrating elsewhere. Migration takes up a lot of network resources that could otherwise be dedicated to online processing.
RDBMS-based MMOs have huge issues with data locking and referential integrity. Downtime is used to clean up stale locks and integrity breaks from activity logs.
Most of the games implement geographically sited data caches for static or semi-static (see caching summary data below) information in heavy use areas, i.e. east coast vs. west coast USA. These caches are updated manually during the downtime.

As far as Software reasons go:

Games, when operating, use a lot of OLTP - that's On Line Transaction Processing - type of reads/writes to databases. However, sometimes you want a summary report... like how many of a particular beast you've killed in the past 3 years of grinding. That's best handled by an OLAP report -- that's On Line Analytical Processing -- which contains summary info based on a lot of rows in a giant dataset. In reality, games implement systems that use OLAP to build a cache to limit the number of queries that need to be read -- i.e., they build a total as of a certain date, and then when you ask the question they just read the rows from the OLTP store that summarize the time period since the certain date. Merge the two, and you can actually quantify how worthless your life has become.
The aforementioned hot-patching, which I see as a software problem but software developers see as a systems problem. ;)
Replenishing stores of items -- in Eve, the asteroid belts are refreshed every night and certain complexes are recycled as well. This can be done to an extent while on-line, but some of the algorithms are too complex and need to be done in an off-line mode because they briefly bring the database to it's knees while they summarize the previous day's economic activity.

Running an economy with both closed and open loops is one problem for MMO operators -- if you don't believe me, read some of the academic papers that have been written about game economies and some of the studies of older games like Ultima Online that had relatively primitive economies. The analysis that needs to happen to replenish the open loops and to identify cheating and other negative economic activity needs to happen offline with a snapshot of the data, which can sometimes only be taken while the database is entirely locked.

If you'll note, Eve's maintenance happens when it's Noon in England, where the primary datacenter is.

I suspect that the total time that Blizzard (I'm inferring that given that it's a Tuesday morning that you're posting your question) quotes for maintenance is for the entire cluster; not every server takes that long to perform work on.

While it might be possible to bring individual servers back up more quickly, that would illicit cries of favouritism towards players whose realms happened to fall earlier in the schedule. As such, they keep everything down until all the work is done; with hundreds of realms to work on, they probably do much of the work in parallel, but still serialize a final check before bringing things back online. If you're doing a hardware upgrade, this is probably serialized across as many data centres as they have.

As to why they perform the maintenance, some of it might just be a performance reboot. While it would be great if such reboots weren't required, the cost of doing so vs the impact of not doing so may be directing their choice here.

When you look at why they can't cluster the processes and perform rolling maintenance, what little people know of the WoW infrastructure suggests that multiple machines provide service for each realm (i.e. one for the world, one for instances and raids, one for battlegrounds, etc.) they don't use a state-shared active-active process setup. There isn't sharing of live state, only of persistent data via a database.

In the end, the mechanics of providing a stateful online service to that large a subscriber base challenges some of the best practices that we might espouse when talking about a website or other traditional internet-based service.

Some of the more recent extended downtimes in EvE Online have been about installing new hardware like a faster SAN. While one can technically move the bulk of the data by creating a new filegroup on the new drive and then emptying the main one, that would have resulted in an extended period of reduced performance due to constant I/O. So they opted to detach the 1.1TB database and move it in one go.

The answer to this question also relies on the specific application. For example, a server handling a specific star system cannot be hotswapped without disrupting game play, so downtime is used to reassign more powerful servers into potential hotspots. In addition, the ownership calculations (sovereignty) of star systems are calculated. This depends on the tens of different variables, all of which can change depending on player actions. Needless to say, doing that live can cause excessive locking and/or other concurrency issues. But addressing those is best left to stackoverflow.

MMORPG server maintenance

Related

Recent Posts