Reading through the comp.lang.python
mailing list, I saw an interesting question concerning the behavior of the default sorting algorithm in python. This led to this post.
Python uses timsort, a clever hybrid sorting algorithm with ideas borrowing from merge sort and (binary) insertion sort. A major idea in timsort is to use the structure of naturally occuring runs (consecutive elements in the list that are either monotone increasing or monotone decreasing) when sorting.
Let's look at the following simple list.
10, 20, 5
A simple sorting algorithm is insertion sort, which just advances through the list and inserts each number into the correct spot. More explicitly, insertion sort would
- Start with the first element,
10
. As a list with one element, it is correctly sorted tautologically. - Now consider the second element,
20
. We insert this into the correct position in the already-sorted list of previous elements. Here, that just means that we verify that20 > 10
, and now we have the sorted sublist consisting of10, 20
. - Now consider the third element,
5
. We want to insert this into the correct position in the already-sorted list of previous elements. A naively easy way to do this is to either scan the list from the right or from the left, and insert into the correct place. For example, scanning from the right would mean that we compare5
to the last element of the sublist,20
. As5 < 20
, we shift left and compare5
to10
. As5 < 10
, we shift left again. As there is nothing left to compare against, we insert5
at the beginning, yielding the sorted list5, 10, 20
.
How many comparisons did this take? This took 20 > 10
, 5 < 20
, and 5 < 10
. This is three comparisons in total.
We can see this programmatically as well. Here is one implementation of insertion_sort, as described above.
def insertion_sort(lst):
'''
Sorts `lst` in-place. Note that this changes `lst`.
'''
for index in range(1, len(lst)):
current_value = lst[index]
position = index
while position > 0 and lst[position - 1] > current_value:
lst[position] = lst[position - 1]
position = position - 1
lst[position] = current_value
Let's also create a simple Number
class, which is just like a regular number, except that anytime a comparison is done it prints out the comparison. This will count the number of comparisons made for us.
class Number:
def __init__(self, value):
self.value = value
def __str__(self):
return str(self.value)
def __repr__(self):
return self.__str__()
def __lt__(self, other):
if self.value < other.value:
print("{} < {}".format(self, other))
return True
print("{} >= {}".format(self, other))
return False
def __eq__(self, other):
if self.value == other.value:
print("{} = {}".format(self, other))
return True
return False
def __gt__(self, other):
return not ((self == other) or (self < other))
def __le__(self, other):
return (self < other) or (self == other)
def __ge__(self, other):
return not (self < other)
def __nq__(self, other):
return not (self == other)
With this class and function, we can run
lst = [Number(10), Number(20), Number(5)]
insertion_sort(lst)
print(lst)
which will print
10 < 20
20 >= 5
10 >= 5
[5, 10, 20]
These are the three comparisons we were expecting to see.
Returning to python's timsort — what happens if we call python's default sorting method on this list? The code
lst = [Number(10), Number(20), Number(5)]
lst.sort()
print(lst)
prints
20 >= 10
5 < 20
5 < 20
5 < 10
[5, 10, 20]
There are four comparisons! And weirdly, the method checks that 5 < 20
twice in a row. What's going on there?1
1Perhaps this is a case where there are multiple cores, and the second core is like the younger brother. It looks up at its older, wiser, first core brother and says things like are you sure that 5 < 20, really? That's definitely how multiple cores works, right?
At its heart, this was at the core of the thread on comp.lang.python. Why are there extra comparisons in cases like this?
Poking around the implementation of timsort taught me a little bit more about timsort.2 2Such as the fact that the implementation is far more nontrivial than I thought it would be. It's a significant chunk of code that optimizes for many different setups. I understand now why the java and python implementations of timsort diverged.
Timsort approaches this sorting task in the following way.
- First, timsort tries to identify how large the first run within the sequence is. So it keeps comparing terms until it finds one that is out of order. In this case, it compares
20
to10
(finding that20 > 10
, and thus the run is increasing), and then compares5
to20
(finding that5 < 20
, and thus that5
is not part of the same run as10, 20
). Now the run is identified, and there is one element left to incorporate. - Next timsort tries to insert
5
into the already-sorted run. It is more correct to say that timsort attempts to do a binary insertion, since one knows already that the run is sorted.3 3Although I should also admit that this isn't a completely correct description either. Timsort does a sort of merge sort on consecutive runs as well, but I ignore this aspect by choosing a sufficiently small example. In this binary insertion, timsort will compare5
with the middle of the already-sorted run10, 20
. But this is a list of length 2, so what is its middle element? It turns out that timsort takes the latter element,20
, in this case. As5 < 20
, timsort concludes that5
should be inserted somewhere in the first half of the run10, 20
, and not in the second half. - Of course, the first half consists entirely of
10
. Thus the remaining comparison is to check that5 < 10
, and now the list is sorted.
We count4
4and account for
all four of the comparisons. The doubled comparison is due to the two tasks of checking whether 5
is in the same run as 10, 20
, and then of deciding through binary insertion where to place 5
in the smaller sublist of 10, 20
.
Now that we've identified a doubled comparison, we might ask Why is it this way? Is this something that should change?
The short answer is it doesn't really matter. A longer answer is that to apply this in general would cause additional comparisons to be made, since this applies in the case when the last element of the run agrees in value with the central value of the run (which may occur for longer lists if there are repeated values). Checking that this happens would probably either involve comparing the last element of the run with the central value (one extra comparison, so nothing is really saved anyway), or perhaps adding another data structure like a skip list (which seems sufficiently more complicated to not be worth the effort). Or it would only apply when sorting really short lists, in which case there isn't much to worry about.
Learning a bit more about timsort made me realize that I could probably learn a lot by really understanding an implementation of timsort, or even a slightly simplified implementation. It's a nice reminder that one can choose to optimize for certain situations or behaviors, and this might not cover all cases perfectly — and that's ok.
Info on how to comment
To make a comment, please send an email using the button below. Your email address won't be shared (unless you include it in the body of your comment). If you don't want your real name to be used next to your comment, please specify the name you would like to use. If you want your name to link to a particular url, include that as well.
bold, italics, and plain text are allowed in comments. A reasonable subset of markdown is supported, including lists, links, and fenced code blocks. In addition, math can be formatted using
$(inline math)$
or$$(your display equation)$$
.Please use plaintext email when commenting. See Plaintext Email and Comments on this site for more. Note also that comments are expected to be open, considerate, and respectful.