mixedmath

Explorations in math and programming
David Lowry-Duda



Reading through the comp.lang.python mailing list, I saw an interesting question concerning the behavior of the default sorting algorithm in python. This led to this post.

Python uses timsort, a clever hybrid sorting algorithm with ideas borrowing from merge sort and (binary) insertion sort. A major idea in timsort is to use the structure of naturally occuring runs (consecutive elements in the list that are either monotone increasing or monotone decreasing) when sorting.

Let's look at the following simple list.

10, 20, 5

A simple sorting algorithm is insertion sort, which just advances through the list and inserts each number into the correct spot. More explicitly, insertion sort would

  1. Start with the first element, 10. As a list with one element, it is correctly sorted tautologically.
  2. Now consider the second element, 20. We insert this into the correct position in the already-sorted list of previous elements. Here, that just means that we verify that 20 > 10, and now we have the sorted sublist consisting of 10, 20.
  3. Now consider the third element, 5. We want to insert this into the correct position in the already-sorted list of previous elements. A naively easy way to do this is to either scan the list from the right or from the left, and insert into the correct place. For example, scanning from the right would mean that we compare 5 to the last element of the sublist, 20. As 5 < 20, we shift left and compare 5 to 10. As 5 < 10, we shift left again. As there is nothing left to compare against, we insert 5 at the beginning, yielding the sorted list 5, 10, 20.

How many comparisons did this take? This took 20 > 10, 5 < 20, and 5 < 10. This is three comparisons in total.

We can see this programmatically as well. Here is one implementation of insertion_sort, as described above.

def insertion_sort(lst):
    '''
    Sorts `lst` in-place. Note that this changes `lst`.
    '''
    for index in range(1, len(lst)):
        current_value = lst[index]
        position = index
        while position &gt; 0 and lst[position - 1] &gt; current_value:
            lst[position] = lst[position - 1]
            position = position - 1
        lst[position] = current_value

Let's also create a simple Number class, which is just like a regular number, except that anytime a comparison is done it prints out the comparison. This will count the number of comparisons made for us.

class Number:
    def __init__(self, value):
        self.value = value

    def __str__(self):
        return str(self.value)

    def __repr__(self):
        return self.__str__()

    def __lt__(self, other):
        if self.value &lt; other.value:
            print("{} &lt; {}".format(self, other))
            return True
        print("{} &gt;= {}".format(self, other))
        return False

    def __eq__(self, other):
        if self.value == other.value:
            print("{} = {}".format(self, other))
            return True
        return False

    def __gt__(self, other):
        return not ((self == other) or (self &lt; other))

    def __le__(self, other):
        return (self &lt; other) or (self == other)

    def __ge__(self, other):
        return not (self &lt; other)

    def __nq__(self, other):
        return not (self == other)

With this class and function, we can run

lst = [Number(10), Number(20), Number(5)]
insertion_sort(lst)
print(lst)

which will print

10 &lt; 20
20 &gt;= 5
10 &gt;= 5
[5, 10, 20]

These are the three comparisons we were expecting to see.

Returning to python's timsort — what happens if we call python's default sorting method on this list? The code

lst = [Number(10), Number(20), Number(5)]
lst.sort()
print(lst)

prints

20 >= 10
5 < 20
5 < 20
5 < 10
[5, 10, 20]

There are four comparisons! And weirdly, the method checks that 5 < 20 twice in a row. What's going on there?1 1Perhaps this is a case where there are multiple cores, and the second core is like the younger brother. It looks up at its older, wiser, first core brother and says things like are you sure that 5 < 20, really? That's definitely how multiple cores works, right?

At its heart, this was at the core of the thread on comp.lang.python. Why are there extra comparisons in cases like this?

Poking around the implementation of timsort taught me a little bit more about timsort.2 2Such as the fact that the implementation is far more nontrivial than I thought it would be. It's a significant chunk of code that optimizes for many different setups. I understand now why the java and python implementations of timsort diverged.

Timsort approaches this sorting task in the following way.

  1. First, timsort tries to identify how large the first run within the sequence is. So it keeps comparing terms until it finds one that is out of order. In this case, it compares 20 to 10 (finding that 20 > 10, and thus the run is increasing), and then compares 5 to 20 (finding that 5 < 20, and thus that 5 is not part of the same run as 10, 20). Now the run is identified, and there is one element left to incorporate.
  2. Next timsort tries to insert 5 into the already-sorted run. It is more correct to say that timsort attempts to do a binary insertion, since one knows already that the run is sorted.3 3Although I should also admit that this isn't a completely correct description either. Timsort does a sort of merge sort on consecutive runs as well, but I ignore this aspect by choosing a sufficiently small example. In this binary insertion, timsort will compare 5 with the middle of the already-sorted run 10, 20. But this is a list of length 2, so what is its middle element? It turns out that timsort takes the latter element, 20, in this case. As 5 < 20, timsort concludes that 5 should be inserted somewhere in the first half of the run 10, 20, and not in the second half.
  3. Of course, the first half consists entirely of 10. Thus the remaining comparison is to check that 5 < 10, and now the list is sorted.

We count4 4and account for all four of the comparisons. The doubled comparison is due to the two tasks of checking whether 5 is in the same run as 10, 20, and then of deciding through binary insertion where to place 5 in the smaller sublist of 10, 20.

Now that we've identified a doubled comparison, we might ask Why is it this way? Is this something that should change?

The short answer is it doesn't really matter. A longer answer is that to apply this in general would cause additional comparisons to be made, since this applies in the case when the last element of the run agrees in value with the central value of the run (which may occur for longer lists if there are repeated values). Checking that this happens would probably either involve comparing the last element of the run with the central value (one extra comparison, so nothing is really saved anyway), or perhaps adding another data structure like a skip list (which seems sufficiently more complicated to not be worth the effort). Or it would only apply when sorting really short lists, in which case there isn't much to worry about.

Learning a bit more about timsort made me realize that I could probably learn a lot by really understanding an implementation of timsort, or even a slightly simplified implementation. It's a nice reminder that one can choose to optimize for certain situations or behaviors, and this might not cover all cases perfectly — and that's ok.


Leave a comment

Info on how to comment

To make a comment, please send an email using the button below. Your email address won't be shared (unless you include it in the body of your comment). If you don't want your real name to be used next to your comment, please specify the name you would like to use. If you want your name to link to a particular url, include that as well.

bold, italics, and plain text are allowed in comments. A reasonable subset of markdown is supported, including lists, links, and fenced code blocks. In addition, math can be formatted using $(inline math)$ or $$(your display equation)$$.

Please use plaintext email when commenting. See Plaintext Email and Comments on this site for more. Note also that comments are expected to be open, considerate, and respectful.

Comment via email