Index for multiple columns in ActiveRecord

In ActiveRecord there are two ways to declare indexes for multiple columns:

add_index :classifications, [:species, :family, :trivial_names]

add_index :classifications, :species
add_index :classifications, :family
add_index :classifications, :trivial_names

Is there any difference between the first approach and the second one? If so, when should I use the first and when the second?

You are comparing a composite index with a set of independent indices. They are just different.

Think of it this way: a compound index gives you rapid look-up of the first field in a nested set of fields followed by rapid look-up of the second field within ONLY the records already selected by the first field, followed by rapid look-up of the third field - again, only within the records selected by the previous two indices.

Lets take an example. Your database engine will take no more than 20 steps to locate a unique value within 1,000,000 records (if memory serves) if you are using an index. This is true whether you are using a composite or and independent index - but ONLY for the first field ("species" in your example although I'd think you'd want Family, Species, and then Common Name).

Now, let's say that there are 100,000 matching records for this first field value. If you have only single indices, then any lookup within these records will take 100,000 steps: one for each record retrieved by the first index. This is because the second index will not be used (in most databases - this is a bit of a simplification) and a brute force match must be used.

If you have a composite index then your search is much faster because your second field search will have an index within the first set of values. In this case you'll need no more than 17 steps to get to your first matching value on field 2 within the 100,000 matches on field 1 (log base 2 of 100,000).

So: steps needed to find a unique record out of a database of 1,000,000 records using a composite index on 3 nested fields where the first retrieves 100,000 and the second retrieves 10,000 = 20 + 17 + 14 = 51 steps.

Steps needed under the same conditions with just independent indices = 20 + 100,000 + 10,000 = 110,020 steps.

Big difference, eh?

Now, don't go nuts putting composite indices everywhere. First, they are expensive on inserts and updates. Second, they are only brought to bear if you are truly searching across nested data (for another example, I use them when pulling data for logins for a client over a given date range). Also, they are not worth it if you are working with relatively small data sets.

Finally, check your database documentation. Databases have grown extremely sophisticated in the ability to deploy indices these days and the Database 101 scenario I described above may not hold for some (although I always develop as if it does just so I know what I am getting).

The two approaches are different. The first creates a single index on three attributes, the second creates three single-attribute indices. Storage requirements will be different, although without distributions it's not possible to say which would be larger.

Indexing three columns [A, B, C] works well when you need to access for values of A, A+B and A+B+C. It won't be any good if your query (or find conditions or whatever) doesn't reference A.

When A, B and C are indexed separately, some DBMS query optimizers will consider combining two or more indices (subject to the optimizer's estimate of efficiency) to give a similar result to a single multi-column index.

Suppose you have some e-commerce system. You want to query orders by purchase_date, customer_id and sometimes both. I'd start by creating two indices: one for each attribute.

On the other hand, if you always specify purchase_date and customer_id, then a single index on both columns would probably be most efficient. The order is significant: if you also wanted to query orders for all dates for a customer, then make the customer_id the first column in the index.

From the docs:

When creating an index on multiple columns, the first column is used as a name for the index. For example, when you specify an index on two columns [:first, :last], the DBMS creates an index for both columns as well as an index for the first column :first. Using just the first name for this index makes sense, because you will never have to create a singular index with this name.

Use the first method when creating a compound index, and the second when creating indexes on single attributes.

There are some good points here on when to use compound indexes, but the gist is that they are good when utilizing a where on multiple attributes. Note that they should be used alongside other indexes (always index your foriegn keys) - not as a replacement.

Index for multiple columns in ActiveRecord

Related

Recent Posts