Tag Archives: timsort

Extra comparisons in Python’s timsort

Reading through the comp.lang.python mailing list, I saw an interesting question concerning the behavior of the default sorting algorithm in python. This led to this post.

Python uses timsort, a clever hybrid sorting algorithm with ideas borrowing from merge sort and (binary) insertion sort. A major idea in timsort is to use the structure of naturally occuring runs (consecutive elements in the list that are either monotone increasing or monotone decreasing) when sorting.

Let’s look at the following simple list.

10, 20, 5

A simple sorting algorithm is insertion sort, which just advances through the list and inserts each number into the correct spot. More explicitly, insertion sort would

  1. Start with the first element, 10. As a list with one element, it is correctly sorted tautologically.
  2. Now consider the second element, 20. We insert this into the correct position in the already-sorted list of previous elements. Here, that just means that we verify that 20 > 10, and now we have the sorted sublist consisting of 10, 20.
  3. Now consider the third element, 5. We want to insert this into the correct position in the already-sorted list of previous elements. A naively easy way to do this is to either scan the list from the right or from the left, and insert into the correct place. For example, scanning from the right would mean that we compare 5 to the last element of the sublist, 20. As 5 < 20, we shift left and compare 5 to 10. As 5 < 10, we shift left again. As there is nothing left to compare against, we insert 5 at the beginning, yielding the sorted list 5, 10, 20.

How many comparisons did this take? This took 20 > 10, 5 < 20, and 5 < 10. This is three comparisons in total.

We can see this programmatically as well. Here is one implementation of insertion_sort, as described above.

def insertion_sort(lst):
    '''
    Sorts `lst` in-place. Note that this changes `lst`.
    '''
    for index in range(1, len(lst)):
        current_value = lst[index]
        position = index
        while position > 0 and lst[position - 1] > current_value:
            lst[position] = lst[position - 1]
            position = position - 1
        lst[position] = current_value

Let’s also create a simple Number class, which is just like a regular number, except that anytime a comparison is done it prints out the comparison. This will count the number of comparisons made for us.

class Number:
    def __init__(self, value):
        self.value = value

    def __str__(self):
        return str(self.value)

    def __repr__(self):
        return self.__str__()

    def __lt__(self, other):
        if self.value < other.value:
            print("{} < {}".format(self, other))
            return True
        print("{} >= {}".format(self, other))
        return False

    def __eq__(self, other):
        if self.value == other.value:
            print("{} = {}".format(self, other))
            return True
        return False

    def __gt__(self, other):
        return not ((self == other) or (self < other))

    def __le__(self, other):
        return (self < other) or (self == other)

    def __ge__(self, other):
        return not (self < other)

    def __nq__(self, other):
        return not (self == other)

With this class and function, we can run

lst = [Number(10), Number(20), Number(5)]
insertion_sort(lst)
print(lst)

which will print

10 < 20
20 >= 5
10 >= 5
[5, 10, 20]

These are the three comparisons we were expecting to see.

Returning to python’s timsort — what happens if we call python’s default sorting method on this list? The code

lst = [Number(10), Number(20), Number(5)]
lst.sort()
print(lst)

prints

20 >= 10
5 < 20
5 < 20
5 < 10
[5, 10, 20]

There are four comparisons! And weirdly, the method checks that 5 < 20 twice in a row. What’s going on there?1

At its heart, this was at the core of the thread on comp.lang.python. Why are there extra comparisons in cases like this?

Poking around the implementation of timsort taught me a little bit more about timsort.2

Timsort approaches this sorting task in the following way.

  1. First, timsort tries to identify how large the first run within the sequence is. So it keeps comparing terms until it finds one that is out of order. In this case, it compares 20 to 10 (finding that 20 > 10, and thus the run is increasing), and then compares 5 to 20 (finding that 5 < 20, and thus that 5 is not part of the same run as 10, 20). Now the run is identified, and there is one element left to incorporate.
  2. Next timsort tries to insert 5 into the already-sorted run. It is more correct to say that timsort attempts to do a binary insertion, since one knows already that the run is sorted.3 In this binary insertion, timsort will compare 5 with the middle of the already-sorted run 10, 20. But this is a list of length 2, so what is its middle element? It turns out that timsort takes the latter element, 20, in this case. As 5 < 20, timsort concludes that 5 should be inserted somewhere in the first half of the run 10, 20, and not in the second half.
  3. Of course, the first half consists entirely of 10. Thus the remaining comparison is to check that 5 < 10, and now the list is sorted.

We count4 all four of the comparisons. The doubled comparison is due to the two tasks of checking whether 5 is in the same run as 10, 20, and then of deciding through binary insertion where to place 5 in the smaller sublist of 10, 20.

Now that we’ve identified a doubled comparison, we might ask Why is it this way? Is this something that should change?

The short answer is it doesn’t really matter. A longer answer is that to apply this in general would cause additional comparisons to be made, since this applies in the case when the last element of the run agrees in value with the central value of the run (which may occur for longer lists if there are repeated values). Checking that this happens would probably either involve comparing the last element of the run with the central value (one extra comparison, so nothing is really saved anyway), or perhaps adding another data structure like a skip list (which seems sufficiently more complicated to not be worth the effort). Or it would only apply when sorting really short lists, in which case there isn’t much to worry about.

Learning a bit more about timsort made me realize that I could probably learn a lot by really understanding an implementation of timsort, or even a slightly simplified implementation. It’s a nice reminder that one can choose to optimize for certain situations or behaviors, and this might not cover all cases perfectly — and that’s ok.

Posted in Programming, Python | Tagged , | Leave a comment