Best way to index this very large table

Solution 1:

I will assume you have good reasons to use guids as ids.

Fragmentation is mostly an issue for scans, less so for seeks. Fragmentation has a big impact on read-aheads, and seeks don't use nor need read aheads. An unfragmented index with poor column choice will perform always worse than a 99% fragment index with good, usable, columns. If you'd had described DW reporting style queries that scan the tables then I would had recommend focusing on eliminating fragmentation, but for the load you describe it makes more sense to focus on efficient (covering) seeks and (small) range scans.

Given that youre access pattern is always driven by @userId, this must be the leftmost column in the clustered index. I would also add setOn as the second column in the clustered index because it adds some marginal value on most queries (I say marginal because the @userId is so selective, at worst is 90 records from 90 mil., that the extra filtering added by @setOn is not critical). I woudldn't add any non clustered index, from the queries you describe there is no need for any.

The only problem is the deletion of old records (the 30 days retention). I would recommend against using a secondary NC index to satisfy this. I would rather deploy a weekly partitioning scheme with sliding window, see How to Implement an Automatic Sliding Window in a Partitioned Table on SQL Server 2005. With this solution, the old records are removed by a partition switch, which is just the most efficient way possible. A daily partitioning scheme would satisfy the 30 days retention requirement more exactly and perhaps is worth trying and testing. I hesitate to recommend 30 partitions directly because you describe some queries that have potential to seek a particular @userId records in each partition, and 31 partitions may create performance problems under heavy load. Test and measure both better.

Solution 2:

First add a default constraint on your table. Second, add a partitioning scheme. Third rewrite your most common query.

The clustered index should be set to setOn, user Id. This removes the possibility of the index becoming fragmented. You should use table partitioning to split the table so each month is stored in a separate file. This will reduce maintenance. You can look online for a partition sliding window script that you can run every month to create a new table for the next month, delete the oldest month, and adjust the partitioning scheme. You can also move the really old months to an archive table if storage is of no concern to you.

Your queries where clause should be in the form:

WHERE setOn > @setOn AND userId = @userId

Or when you are returning the whole month:

WHERE setOn BETWEEN @setOnBegin AND @setOnEnd AND userId = @userId

Your new schema design, without the partitioning would look like this:

-- Stub table for foreign key
CREATE TABLE Users
(
 [userId] [uniqueidentifier] NOT NULL
  CONSTRAINT PK_Users PRIMARY KEY NONCLUSTERED
  CONSTRAINT DF_Users_userId DEFAULT NEWID(),
 [userName] VARCHAR(50) NOT NULL
)
GO

CREATE TABLE DiaryEntries
(
 [userId] [uniqueidentifier] NOT NULL
  CONSTRAINT FK_DiaryEntries_Users FOREIGN KEY REFERENCES Users,
 [setOn] [datetime] NOT NULL
  CONSTRAINT DF_DiaryEntries_setOn DEFAULT GETDATE(),
 [entry] [nvarchar](255) NULL,
 CONSTRAINT PK_DiaryEntries PRIMARY KEY CLUSTERED (setOn, userId)
)
GO

After you get that working, you have to add partitioning. For that, start with This blog post for some theory. Then start reading this MSDN whitepaper. The white paper is written for 2005, and there were partition improvements in 2008 that I have not investigated, so the solution might be simpler in 2008.

Solution 3:

Not here to critique your solution, nor am i in a position to do so, as I don't know enough about the problem. Here is my feedback:

  • If the only thing your not happy about is using too much disk space because of the row size, check out sparse columns That way all the nulls don't take up so much space!
  • Having the foreign keys is going to slow down your inserts considerably, have you tested this?