What is the purpose of PAD_INDEX in this SQL Server constraint?

I have the following constraint being applied to one of my tables, but I don't know what PAD_INDEX means.

Can someone enlighten me?

CONSTRAINT [PK_Employees] PRIMARY KEY CLUSTERED 
(
    [EmployeeId] ASC
) WITH (PAD_INDEX  = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
        ^--------------^
         this part here

Solution 1:

Basically, you set PAD_INDEX = ON if you expect a lot of random changes to the index regularly.

That helps avoiding index page splits.

I set it on when I expect 30%+ of random records included in the index to be deleted on a regular basis.

Solution 2:

An index in SQL Server is a B-Tree

  • FILLFACTOR applies to the bottom layer
    This is the leaf node/data layer in the picture below

  • PAD_INDEX ON means "Apply FILLFACTOR to all layers"
    This is the intermediate levels in the picture below (between root and data)

This means that PAD_INDEX is only useful if FILLFACTOR is set. FILLFACTOR determines how much free space in an data page (roughly)

A picture from MSDN:

B-Tree structure

Solution 3:

From MSDN:

PAD_INDEX = { ON | OFF }

Specifies index padding. The default is OFF.

ON: The percentage of free space that is specified by fillfactor is applied to the intermediate-level pages of the index.

OFF or fillfactor is not specified: The intermediate-level pages are filled to near capacity, leaving sufficient space for at least one row of the maximum size the index can have, considering the set of keys on the intermediate pages.

The PAD_INDEX option is useful only when FILLFACTOR is specified, because PAD_INDEX uses the percentage specified by FILLFACTOR. If the percentage specified for FILLFACTOR is not large enough to allow for one row, the Database Engine internally overrides the percentage to allow for the minimum. The number of rows on an intermediate index page is never less than two, regardless of how low the value of fillfactor.

In backward compatible syntax, WITH PAD_INDEX is equivalent to WITH PAD_INDEX = ON.

Solution 4:

This is actually a highly complex subject. Turning on PAD_INDEX can have dramatic effects on read performance and memory pressure in large tables. The larger the table the bigger the effect. As a rule I'd say you want to leave it off unless you fall into some NOT UNCOMMON categories. Then, follow this advice carefully. As I show in the example case below, adjusting FILLFACTOR when PAD_INDEX is ON can have an exponential effect that needs to be carefully balanced.

  1. PAD_INDEX ALWAYS has a detrimental effect on reads! The lower your FILLFACTOR the bigger the effect so you need to pay close attention to the value of FILLFACTOR when you turn it on. On large tables you essentially stop thinking about FILLFACTOR in terms of reducing leaf splits and start thinking about its effect on intermediate bloat vs intermediate splits.
  2. PAD_INDEX rarely has a useful effect on indexes with less than 100,000 rows and NEVER has a positive effect on indexes covering identity or insert-time type columns were inserts are always to the end of the table.
  3. From the above you should see that if you turn PAD_INDEX on you must carefully balance the negative effects with the positive.

Rules of thumb: PAD_INDEX is rarely useful on:

  1. Non-clustered indexes - unless they are quite wide.
  2. On clustered indexes of very narrow tables.
  3. On tables that have less than 100K rows - unless inserts are highly clustered and even then it can be questionable.

You MUST understand how it works: When you insert into an index the row must fit into the the leaf block that contains the appropriate range of keys. Clustered indexes typically have much wider rows than non-clustered indexes and so their leaf blocks hold fewer rows. FillFactor creates space for new rows in the leaf but in the case of very wide rows or a large volume of inserts that are clustered together rather than evenly distributed it's often impractical or impossible to create enough slack (1-pct fill) to prevent splits.

When a split occurs a new intermediate row is created to point to the new block and that row must fit into its appropriate block. If that intermediate block is full it must first be split. Splits can run all the way down to the root if you are particularly unlucky. When the root splits you end up creating a new index level.

The point of PAD_INDEX is to force a minimum amount of free space in your intermediate level blocks.

After a rebuild there may be little or no space at the lower levels. So you can have massive splitting of your intermediates all over the place if you have lots of leaf splits and PAD_INDEX isn't turned on!

Mostly though, splits can be managed with FILLFACTOR. The bigger split problems happen with insert patterns that virtually guarantee you won't have enough free space and turning PAD_INDEX on then helps alleviate this by providing space at deeper levels so when a split does occur you are less likely to incur lots of multilevel splits.

Example Case

I have a customer table with 100K rows. On any given day about 5% of my customers will be active. I have a table that records activity by customer by time. On average a customer performs 20 actions and the description takes, on average, 1K. So I collect 100MB of data and lets assume I've got a year already in the table - so 36GB.

The table has inserts of 1Kb rows with customer_number and insert_time (in that order) for key columns. Clearly the average customer will split an 8K leaf block several times while inserting their expected 20 rows because each row will insert immediately after the preceding row in the same block until it splits and splits and splits (makes one consider a heap with only non-clustered indexes...). If the intermediate block pointing to the appropriate leaf doesn't have enough room for at least 4 rows (in reality probably 8 but...) the intermediate will need to split. Given this example's key takes up 22 bytes, an intermediate block can hold 367 entries. This means I need 6% free space in my intermediate block or a fill of 94% to hold the 4 entries.

Notice that even a 1% FILLFACTOR won't stop leaf block splits since a block can only hold 8 rows. Setting FILLFACTOR TO 80% will only allow 1 row to be added before the leaf splits but will inject over 800 bytes of free space per intermediate block if PAD_INDEX is on. That's ~800 empty bytes for EVERY intermediate block when I only need 88.

This is really important!: So if I have 36M rows already in the table, using 80% means 294 rows per intermediate block, meaning 122K blocks, meaning I've injected 98MB into my intermediate block structure when 94% lets 345 rows fit per block so there are only 104K intermediate blocks (yes I'm leaving out the lower levels for simplicity). Adding 88 bytes to each of 104K blocks adds only 9.2MB as opposed to 98MB.

Now consider that only 5% of my customers did anything. Some did more than 20 things and some less so some blocks split anyway and since only 275KB were actually needed to hold the day's index rows (100k/8*22), the best case is that only 8.9MB of my 9.2MB were dead air. If split prevention is important it's well worth 9mb however I'd be thinking harder about 98mb.

So by turning PAD_INDEX on I should be giving up on controlling leaf splits entirely and turning to controlling intermediate splits.

DON'T bother worrying about anything but the first intermediate level! There is a butterfly effect induced by any clustering (in this case clustering of customer_number) that will throw any calculation you make out the window. Unless your inserts are perfectly uniform your margin of error in finding the right number to balance bloat with splits is typically far bigger than the effect of the lower level block space.