Oracle: 1 Large Server vs. 2 Smaller Servers?
We are in the planning stages of setting up our production Oracle 10gR2 environment. Our budget gives us the ability to buy 2 processor licenses of Oracle DB Standard Edition. We have minimal experience with Oracle so I'll defer to anyone who has used it. We are trying to decide if we should set up a single dual quad-core box or 2 individual quad-core boxes in a RAC configuration.
Our DB right now is about 60 GB, and at our peak, we'll have up to 150 concurrent users. Most of the big stuff is done via batch processing at night.
My gut tells me that having 2 boxes in a RAC configuration can't be a bad thing because it provides a true hardware failover solution. DB stored in a shared LUN on a SAN via iSCSI. Plus if we ever need to add capacity, we already have boxes in place that can be upgraded with extra procs (I assume with zero downtime, since it's set up in a RAC config) if we add extra licenses, or RAM.
Does RAC have any performance penalties? Will it add extra latency? Is there any true advantage for having dual processor boxes running these systems? If we build out the Oracle boxes with special hardware: hardware iSCSI cards, TOE NICs, will these boxes be solid? We are deploying on 64-bit Windows.
So what would you do? One box or two?
Two things I would consider carefully:
1) 10gR2 support is already starting to fall off the maintenance cliff. In about a year it will become VERY expensive to get patches, and Oracle has already stated the last public patches will be released Summer 2010. Is there any reason you are not using version 11 in building out a new server?
2) RAC / Data Guard is actually quite painful to setup and maintain. Not only does it require Enterprise licenses (far more expensive than the standard license), your server OS must also be the Enterprise edition and be configured in a windows cluster.
Personally, if you can tolerate the potential for a short downtime window / potential data loss, you are are much better off with the single box, ideally having a "standby" server that could be spun up. These are just my opinions, and I am sure they can be disputed, but really a 60GB DB with 150 peak users is not that large anymore. The dual quad-core box will probably scale better than the RAC configuration as well, certainly if you would put double the amount of RAM in the single box.
Keep it simple if possible. One box is better than many for most database applications unless you have specific reasons - such as performance.
Hardware failure is much less likely to drive downtime than something breaking due to a configuration error. Simpler configuration will generally pay off by having fewer failure modes. You might find that you actually get a more reliable system in practice with just one server and a simple replicated database on a standby machine.
Don't go for a tighter SLA than you really need and don't build a system with additional complexity to achieve a SLA that is not supported by all aspects of the software, operations and hardware.
A slightly different view of 'five nines' reliability.
Doubtless your vendor will claim some impressive reliability statistics for their product. As a counter, here is a take 'nines' relating various levels of SLAs to what is actually needed to achieve them in practice:
Two nines (99%) sort of translates to a 24hr DR period, and may be relevant to applications such as data warehouse systems where the system in not directly supporting operational processes. This service level be achieved by restoring the system from backups.
Three nines (99.9%) sort of equates to a 4hr DR window, and is really all that is needed for most line of business applications. This type of DR strategy can be achieved with simple log shipping replication and a standby server. Often, the simplicity of this type of architecture means that it has relatively few configuration based failure modes and achieves considerably better reliability in practice. There are plenty of instances of two-tier 4GL applications achieving continuous uptimes of months or years with this type of configuration.
Four nines (99.99%) can be viewed as a DR window of a few minutes, and needs a hot standby or hot failover architecture. Achieving this type of SLA in practice is quite difficult and typically requires software designed for it. Ironically, the additional complexity of clustered N-tier architectures opens a much wider surface of failure modes due to misconfiguration or slips in change management.
It should be noted that configuration errors and poor change management are the largest causes of unscheduled downtime in data centre operations and are much more likely to cause an unscheduled outage than hardware failure on a server.Five nines (99.999%) requires no more than a few seconds of unscheduled downtime every year. This type of SLA puts you in the realm of specialised hardware and software built for fault tolerance. Implementing this type of SLA is expensive, requires a purpose-built platform such as a mainframe and people with specialised skills that most companies don't have in-house.
Most people claiming 99.999% SLAs are full of shit, q.v. Microsoft, Accenture and London Stock Exchange.