Fastest way to update 120 Million records
Solution 1:
The only sane way to update a table of 120M records is with a SELECT
statement that populates a second table. You have to take care when doing this. Instructions below.
Simple Case
For a table w/out a clustered index, during a time w/out concurrent DML:
SELECT *, new_col = 1 INTO clone.BaseTable FROM dbo.BaseTable
- recreate indexes, constraints, etc on new table
- switch old and new w/ ALTER SCHEMA ... TRANSFER.
- drop old table
If you can't create a clone schema, a different table name in the same schema will do. Remember to rename all your constraints and triggers (if applicable) after the switch.
Non-simple Case
First, recreate your BaseTable
with the same name under a different schema, eg clone.BaseTable
. Using a separate schema will simplify the rename process later.
- Include the clustered index, if applicable. Remember that primary keys and unique constraints may be clustered, but not necessarily so.
- Include identity columns and computed columns, if applicable.
- Include your new INT column, wherever it belongs.
-
Do not include any of the following:
- triggers
- foreign key constraints
- non-clustered indexes/primary keys/unique constraints
- check constraints or default constraints. Defaults don't make much of difference, but we're trying to keep things minimal.
Then, test your insert w/ 1000 rows:
-- assuming an IDENTITY column in BaseTable
SET IDENTITY_INSERT clone.BaseTable ON
GO
INSERT clone.BaseTable WITH (TABLOCK) (Col1, Col2, Col3)
SELECT TOP 1000 Col1, Col2, Col3 = -1
FROM dbo.BaseTable
GO
SET IDENTITY_INSERT clone.BaseTable OFF
Examine the results. If everything appears in order:
- truncate the clone table
- make sure the database in in bulk-logged or simple recovery model
- perform the full insert.
This will take a while, but not nearly as long as an update. Once it completes, check the data in the clone table to make sure it everything is correct.
Then, recreate all non-clustered primary keys/unique constraints/indexes and foreign key constraints (in that order). Recreate default and check constraints, if applicable. Recreate all triggers. Recreate each constraint, index or trigger in a separate batch. eg:
ALTER TABLE clone.BaseTable ADD CONSTRAINT UQ_BaseTable UNIQUE (Col2)
GO
-- next constraint/index/trigger definition here
Finally, move dbo.BaseTable
to a backup schema and clone.BaseTable
to the dbo schema (or wherever your table is supposed to live).
-- -- perform first true-up operation here, if necessary
-- EXEC clone.BaseTable_TrueUp
-- GO
-- -- create a backup schema, if necessary
-- CREATE SCHEMA backup_20100914
-- GO
BEGIN TRY
BEGIN TRANSACTION
ALTER SCHEMA backup_20100914 TRANSFER dbo.BaseTable
-- -- perform second true-up operation here, if necessary
-- EXEC clone.BaseTable_TrueUp
ALTER SCHEMA dbo TRANSFER clone.BaseTable
COMMIT TRANSACTION
END TRY
BEGIN CATCH
SELECT ERROR_MESSAGE() -- add more info here if necessary
ROLLBACK TRANSACTION
END CATCH
GO
If you need to free-up disk space, you may drop your original table at this time, though it may be prudent to keep it around a while longer.
Needless to say, this is ideally an offline operation. If you have people modifying data while you perform this operation, you will have to perform a true-up operation with the schema switch. I recommend creating a trigger on dbo.BaseTable
to log all DML to a separate table. Enable this trigger before you start the insert. Then in the same transaction that you perform the schema transfer, use the log table to perform a true-up. Test this first on a subset of the data! Deltas are easy to screw up.
Solution 2:
If you have the disk space, you could use SELECT INTO and create a new table. It's minimally logged, so it would go much faster
select t.*, int_field = CAST(-1 as int)
into mytable_new
from mytable t
-- create your indexes and constraints
GO
exec sp_rename mytable, mytable_old
exec sp_rename mytable_new, mytable
drop table mytable_old
Solution 3:
I break the task up into smaller units. Test with different batch size intervals for your table, until you find an interval that performs optimally. Here is a sample that I have used in the past.
declare @counter int
declare @numOfRecords int
declare @batchsize int
set @numOfRecords = (SELECT COUNT(*) AS NumberOfRecords FROM <TABLE> with(nolock))
set @counter = 0
set @batchsize = 2500
set rowcount @batchsize
while @counter < (@numOfRecords/@batchsize) +1
begin
set @counter = @counter + 1
Update table set int_field = -1 where int_field <> -1;
end
set rowcount 0
Solution 4:
If your int_field is indexed, remove the index before running the update. Then create your index again...
5 hours seem like a lot for 120 million recs.