Natural (human alpha-numeric) sort in Microsoft SQL 2005
We have a large database on which we have DB side pagination. This is quick, returning a page of 50 rows from millions of records in a small fraction of a second.
Users can define their own sort, basically choosing what column to sort by. Columns are dynamic - some have numeric values, some dates and some text.
While most sort as expected text sorts in a dumb way. Well, I say dumb, it makes sense to computers, but frustrates users.
For instance, sorting by a string record id gives something like:
rec1
rec10
rec14
rec2
rec20
rec3
rec4
...and so on.
I want this to take account of the number, so:
rec1
rec2
rec3
rec4
rec10
rec14
rec20
I can't control the input (otherwise I'd just format in leading 000s) and I can't rely on a single format - some are things like "{alpha code}-{dept code}-{rec id}".
I know a few ways to do this in C#, but can't pull down all the records to sort them, as that would be to slow.
Does anyone know a way to quickly apply a natural sort in Sql server?
We're using:
ROW_NUMBER() over (order by {field name} asc)
And then we're paging by that.
We can add triggers, although we wouldn't. All their input is parametrised and the like, but I can't change the format - if they put in "rec2" and "rec10" they expect them to be returned just like that, and in natural order.
We have valid user input that follows different formats for different clients.
One might go rec1, rec2, rec3, ... rec100, rec101
While another might go: grp1rec1, grp1rec2, ... grp20rec300, grp20rec301
When I say we can't control the input I mean that we can't force users to change these standards - they have a value like grp1rec1 and I can't reformat it as grp01rec001, as that would be changing something used for lookups and linking to external systems.
These formats vary a lot, but are often mixtures of letters and numbers.
Sorting these in C# is easy - just break it up into { "grp", 20, "rec", 301 }
and then compare sequence values in turn.
However there may be millions of records and the data is paged, I need the sort to be done on the SQL server.
SQL server sorts by value, not comparison - in C# I can split the values out to compare, but in SQL I need some logic that (very quickly) gets a single value that consistently sorts.
@moebius - your answer might work, but it does feel like an ugly compromise to add a sort-key for all these text values.
order by LEN(value), value
Not perfect, but works well in a lot of cases.
Most of the SQL-based solutions I have seen break when the data gets complex enough (e.g. more than one or two numbers in it). Initially I tried implementing a NaturalSort function in T-SQL that met my requirements (among other things, handles an arbitrary number of numbers within the string), but the performance was way too slow.
Ultimately, I wrote a scalar CLR function in C# to allow for a natural sort, and even with unoptimized code the performance calling it from SQL Server is blindingly fast. It has the following characteristics:
- will sort the first 1,000 characters or so correctly (easily modified in code or made into a parameter)
- properly sorts decimals, so 123.333 comes before 123.45
- because of above, will likely NOT sort things like IP addresses correctly; if you wish different behaviour, modify the code
- supports sorting a string with an arbitrary number of numbers within it
- will correctly sort numbers up to 25 digits long (easily modified in code or made into a parameter)
The code is here:
using System;
using System.Data.SqlTypes;
using System.Text;
using Microsoft.SqlServer.Server;
public class UDF
{
[SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic=true)]
public static SqlString Naturalize(string val)
{
if (String.IsNullOrEmpty(val))
return val;
while(val.Contains(" "))
val = val.Replace(" ", " ");
const int maxLength = 1000;
const int padLength = 25;
bool inNumber = false;
bool isDecimal = false;
int numStart = 0;
int numLength = 0;
int length = val.Length < maxLength ? val.Length : maxLength;
//TODO: optimize this so that we exit for loop once sb.ToString() >= maxLength
var sb = new StringBuilder();
for (var i = 0; i < length; i++)
{
int charCode = (int)val[i];
if (charCode >= 48 && charCode <= 57)
{
if (!inNumber)
{
numStart = i;
numLength = 1;
inNumber = true;
continue;
}
numLength++;
continue;
}
if (inNumber)
{
sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
inNumber = false;
}
isDecimal = (charCode == 46);
sb.Append(val[i]);
}
if (inNumber)
sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
var ret = sb.ToString();
if (ret.Length > maxLength)
return ret.Substring(0, maxLength);
return ret;
}
static string PadNumber(string num, bool isDecimal, int padLength)
{
return isDecimal ? num.PadRight(padLength, '0') : num.PadLeft(padLength, '0');
}
}
To register this so that you can call it from SQL Server, run the following commands in Query Analyzer:
CREATE ASSEMBLY SqlServerClr FROM 'SqlServerClr.dll' --put the full path to DLL here
go
CREATE FUNCTION Naturalize(@val as nvarchar(max)) RETURNS nvarchar(1000)
EXTERNAL NAME SqlServerClr.UDF.Naturalize
go
Then, you can use it like so:
select *
from MyTable
order by dbo.Naturalize(MyTextField)
Note: If you get an error in SQL Server along the lines of Execution of user code in the .NET Framework is disabled. Enable "clr enabled" configuration option., follow the instructions here to enable it. Make sure you consider the security implications before doing so. If you are not the db admin, make sure you discuss this with your admin before making any changes to the server configuration.
Note2: This code does not properly support internationalization (e.g., assumes the decimal marker is ".", is not optimized for speed, etc. Suggestions on improving it are welcome!
Edit: Renamed the function to Naturalize instead of NaturalSort, since it does not do any actual sorting.
I know this is an old question but I just came across it and since it's not got an accepted answer.
I have always used ways similar to this:
SELECT [Column] FROM [Table]
ORDER BY RIGHT(REPLICATE('0', 1000) + LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX)))), 1000)
The only common times that this has issues is if your column won't cast to a VARCHAR(MAX), or if LEN([Column]) > 1000 (but you can change that 1000 to something else if you want), but you can use this rough idea for what you need.
Also this is much worse performance than normal ORDER BY [Column], but it does give you the result asked for in the OP.
Edit: Just to further clarify, this the above will not work if you have decimal values such as having 1
, 1.15
and 1.5
, (they will sort as {1, 1.5, 1.15}
) as that is not what is asked for in the OP, but that can easily be done by:
SELECT [Column] FROM [Table]
ORDER BY REPLACE(RIGHT(REPLICATE('0', 1000) + LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX)))) + REPLICATE('0', 100 - CHARINDEX('.', REVERSE(LTRIM(RTRIM(CAST([Column] AS VARCHAR(MAX))))), 1)), 1000), '.', '0')
Result: {1, 1.15, 1.5}
And still all entirely within SQL. This will not sort IP addresses because you're now getting into very specific number combinations as opposed to simple text + number.
Here's a solution written for SQL 2000. It can probably be improved for newer SQL versions.
/**
* Returns a string formatted for natural sorting. This function is very useful when having to sort alpha-numeric strings.
*
* @author Alexandre Potvin Latreille (plalx)
* @param {nvarchar(4000)} string The formatted string.
* @param {int} numberLength The length each number should have (including padding). This should be the length of the longest number. Defaults to 10.
* @param {char(50)} sameOrderChars A list of characters that should have the same order. Ex: '.-/'. Defaults to empty string.
*
* @return {nvarchar(4000)} A string for natural sorting.
* Example of use:
*
* SELECT Name FROM TableA ORDER BY Name
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1-1.
* 2. A1-1. 2. A1.
* 3. R1 --> 3. R1
* 4. R11 4. R11
* 5. R2 5. R2
*
*
* As we can see, humans would expect A1., A1-1., R1, R2, R11 but that's not how SQL is sorting it.
* We can use this function to fix this.
*
* SELECT Name FROM TableA ORDER BY dbo.udf_NaturalSortFormat(Name, default, '.-')
* TableA (unordered) TableA (ordered)
* ------------ ------------
* ID Name ID Name
* 1. A1. 1. A1.
* 2. A1-1. 2. A1-1.
* 3. R1 --> 3. R1
* 4. R11 4. R2
* 5. R2 5. R11
*/
ALTER FUNCTION [dbo].[udf_NaturalSortFormat](
@string nvarchar(4000),
@numberLength int = 10,
@sameOrderChars char(50) = ''
)
RETURNS varchar(4000)
AS
BEGIN
DECLARE @sortString varchar(4000),
@numStartIndex int,
@numEndIndex int,
@padLength int,
@totalPadLength int,
@i int,
@sameOrderCharsLen int;
SELECT
@totalPadLength = 0,
@string = RTRIM(LTRIM(@string)),
@sortString = @string,
@numStartIndex = PATINDEX('%[0-9]%', @string),
@numEndIndex = 0,
@i = 1,
@sameOrderCharsLen = LEN(@sameOrderChars);
-- Replace all char that have the same order by a space.
WHILE (@i <= @sameOrderCharsLen)
BEGIN
SET @sortString = REPLACE(@sortString, SUBSTRING(@sameOrderChars, @i, 1), ' ');
SET @i = @i + 1;
END
-- Pad numbers with zeros.
WHILE (@numStartIndex <> 0)
BEGIN
SET @numStartIndex = @numStartIndex + @numEndIndex;
SET @numEndIndex = @numStartIndex;
WHILE(PATINDEX('[0-9]', SUBSTRING(@string, @numEndIndex, 1)) = 1)
BEGIN
SET @numEndIndex = @numEndIndex + 1;
END
SET @numEndIndex = @numEndIndex - 1;
SET @padLength = @numberLength - (@numEndIndex + 1 - @numStartIndex);
IF @padLength < 0
BEGIN
SET @padLength = 0;
END
SET @sortString = STUFF(
@sortString,
@numStartIndex + @totalPadLength,
0,
REPLICATE('0', @padLength)
);
SET @totalPadLength = @totalPadLength + @padLength;
SET @numStartIndex = PATINDEX('%[0-9]%', RIGHT(@string, LEN(@string) - @numEndIndex));
END
RETURN @sortString;
END