Engineer in Tokyo

Python Sets

I had an application with two lists of unique items that I wanted to take the intersection of. I had figured that using a python set would be faster but I didn’t realize that it would be faster than the simple list comprehension by this much.

~$  python -m timeit -n 1000 "range1500=range(500, 1500);[x for x in range(1000) if x in range1500]"
1000 loops, best of 3: 18.2 msec per loop

~$ python -m timeit -n 1000 "set(range(1000)).intersection(range(500, 1500))"
1000 loops, best of 3: 120 usec per loop

It’s about 150 times faster (notice the difference in order of magnitude, usec vs. msec). It’s not actually because of the list comprehension but rather because of the repeated use of in within the list comprehension. The check for x in range1500 is executed 1000 times which adds up.

To find out what actually takes so much time lets first check the range() function.

~$ python -m timeit -n 1000 "range(500, 1500)"
1000 loops, best of 3: 12.9 usec per loop

We only call this twice so this isn’t what’s taking several milliseconds.

Next lets try the in statement.

~$ python -m timeit -n 1000 "1 in range(1000)"
1000 loops, best of 3: 12.3 usec per loop

Here we go. The in statement only takes 12.3 microseconds but it’s run 1000 times. That adds up to 12.3 microseconds. We can safely say this is what’s taking a majority of the time. But there is still about 5 micro seconds unaccounted for.

What we did was check this in the best case where it finds a match at the beginning of the list. What if it’s at the end?

~$ python -m timeit -n 1000 "1000 in range(1000)"
1000 loops, best of 3: 35.2 usec per loop

Here we see it takes a lot more time to run the inclusion check. What’s happening in the example is that, since I was using a simple range statement, it slowly goes from a the best case scenario to the worst case scenario so we end of getting an average of somewhere in between.

In any case sets are a very useful tool to avoid having to do inclusion checks on lists when you need to get the intersection, union, or diff of a list.