Natural sorting of strings
Sorts strings that contain numbers, not character-wise (lexical) but regarding the numeric value: “a1” < “a2” < “a10” < “a11” < “a20”
Characterwise order: |
Natural order: |
---|---|
Picture1.jpg | Picture1.jpg |
Picture10.jpg | Picture2.jpg |
Picture2.jpg | Picture5.jpg |
Picture251.jpg | Picture10.jpg |
Picture26.jpg | Picture26.jpg |
Picture5.jpg | Picture251.jpg |
String sorting algorithms used in computers usually compare the strings character by character and decide upon the first difference whether one string comes before or after another. This works well for words and names, but not for numbers (see table at the right side). Therefore one must consider the entire number and compare by its numeric value. While numbers are basically also sorted from left to right, digit by digit, this is only true if both numbers have the same digit count. For shorter numbers you need to “imagine the leading zeros”. Characterwise sorting puts “ab” before “c”, but also “12” before “3” which seems wrong to us.
Put it simple, natural sorting produces the following order:
a < a0 < a01 < a1 < a1a < a1b < a2 < a10 < a11 < a20 < a100 < b
Strings without digits are sorted as usual. Further features of my implementation are:
What this function cannot do:
Compatibility:
Usage notes
Sorting data can be reduced to pair comparison of items in the sorting algorithm. So this is not a sorting algorithm (like Bubble Sort or Quicksort), but merely a comparison function. It can also be used separately, where ever two strings need to be sorted. In order to sort entire lists, you can for example use the following code:
myList.Add("abc");
myList.Add("def");
// …
myList.Sort(NatCompare);
The source code contains preprocessor statements (“#define …
”) at the beginning of the file that can be used to configure the function processing. You can for instance deactivate the support for local special characters or negative numbers. These code sections are then no longer compiled in and don’t use any space or runtime. These statements must always be at the beginning of the file, also in front of the using
and namespace
specifications. Since these are only symbol definitions, they can also be defined in the project configuration (Visual Studio: Project properties; Build; Symbols for conditional compilation).
Download
NaturalSort.cs8.0 KiBSource code of the NatCompare function
Performance
Since this function does a lot more, it takes more time than a simple character-wise comparision on the same data. To test the comparison speed, I have run time measurements with a non-representative list of about 21 100 file names from my Windows directory, using the List<string>.Sort(Comparison<string>)
method.
Option | Compared to String.Compare | Compared to NATSORT_CMP_ CHAR | Compared to NATSORT_CMP_ STRING_NOCASE |
---|---|---|---|
Character comparison modes: | |||
NATSORT_CMP_CHAR Simple character comparison by code table |
−33 % | ||
NATSORT_CMP_CHAR_NOCASE Case-insensitive character comparison |
+150 % | +280 % | |
NATSORT_CMP_STRING Regarding local special characters |
+550 % | +875 % | |
NATSORT_CMP_STRING_NOCASE Regarding local special characters, case-insensitive |
+567 % | +900 % | |
Additional options: | |||
+ NATSORT_WITH_SPECIAL Skips certain special characters |
+130 % | +15 % | |
+ NATSORT_WITH_NEGATIVE Supports negative numbers (did not occur in my test data) |
+8 % | +3 % |
All in all you can say that using the function for natural sorting with enabled support for local special characters, negative numbers, case sensitive mode and ignoring certain special characters leads to a pure comparison work of
Changes
References
-
Numeric String Sort in C# (The Code Project)
I found some inspiration for my implementation in this article but eventually did not use its code as a template. In fact I haven’t even tried to understand it and instead started with my own thoughts right away. My function is now better configurable, natively supports negative numbers and local special characters, consists of only a single function and is about as fast (according to my measurement). Somewhere down in the site’s visitor comments there’s a patch to support local special characters, which I have used in my benchmark.
Licence and terms of use
Copying and distribution of this file, with or without modification, are permitted provided the copyright notice and this notice are preserved. This file is offered as-is, without any warranty. (GNU All-Permissive licence)
Statistic data
- Created on 2007-07-08, updated on 2009-02-12.
- Ca. 100 lines of code, estimated development costs: 100 - 400 €