diagnosing mongodb major faulting and erratic behavior

I'd have to get a better look at the trend over time to be sure here (MMS would help), but you may be running into an issue where you have hit the maximum resident memory for MongoDB on that instance - the page faults aren't that high, but I do see a small drop in resident memory. If there is memory pressure elsewhere (from another process) you may be evicting pages from MongoDB and/or having to page to disk more often than you should (a page to disk on EBS is quite slow).

There are a couple of things you can do to make your RAM usage more efficient here:

  1. Remove unnecessary indexes - they'll just take up valuable RAM if used - good candidates for removal are single indexes that are the leftmost element of a compound index elsewhere. It will really depend on your usage and schema here as to what can be removed, so all I can give are general recommendations.
  2. Tune readahead on the EBS volume down - this is counter to what you will read about tuning EBS volumes in general but readahead set too high is actually a drag on memory usage when your access profile is random as opposed to sequential.

To take a look at your readahead settings for a volume you run this command (requires root/sudo privileges):

sudo blockdev --report

The output will list something like this:

RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0     10737418240   /dev/xvda1

The RA column (at 256, which I believe is the default on Amazon) is what we want to tweak here. You do that by running something like this:

blockdev --setra <value> <device name>

For the example above, I would start by halving the value:

blockdev --setra 128 /dev/xvda1

I go into far more detail about how low you should set this value and the reasoning behind it in this answer if you would like to know more. Note that the changed require a mongod process restart to take effect.

After you have done both of those things you may be able to squeeze more performance out of the RAM on that xlarge instance. If not, or if the memory pressure is coming from elsewhere and being more efficient is not enough, then it is time to get some more RAM.

Upgrading the EBS storage to a RAID volume as you mentioned or using the new Provisioned IOPS and EBS optimized instances (or the SSD Cluster Compute nodes if you have money to burn) will help the "slow" part of the operations (paging from disk) but nothing beats the benefits of in-memory operations - they are still an order of magnitude faster, even with the disk subsystem improvements.