itertools.groupby

Posted on November 05, 2012

A relatively unknown part of the Python standard library that I find myself using fairly regularly at work these days is the groupby function in the itertools module. In a nutshell, groupby takes an iterator and breaks it up into sub-iterators based on changes in the "key" of the main iterator. This is of course done without reading the entire source iterator into memory.

The "key" is almost always based on some part of the items returned by the iterator. It is defined by a "key function", much like the sorted builtin function. groupby probably works best when the data is grouped by the key but this isn't strictly necessary. It depends on the use case.

I've successfully used groupby for splitting up the results of large database queries or the contents of large data files. The resulting code ends up being clean and small.

Here's an example:

from itertools import groupby
from operator import itemgetter

things = [('2009-09-02', 11),
          ('2009-09-02', 3),
          ('2009-09-03', 10),
          ('2009-09-03', 4),
          ('2009-09-03', 22),
          ('2009-09-06', 33)]

for key, items in groupby(things, itemgetter(0)):
    print key
    for subitem in items:
        print subitem
    print '-' * 20

Here the dummy data in the "things" list is grouped by the first item of each element (that is, the key is the first element). For each key, the key is printed followed by the items returned by each sub-iterator.

The output looks like:

2009-09-02
('2009-09-02', 11)
('2009-09-02', 3)
--------------------
2009-09-03
('2009-09-03', 10)
('2009-09-03', 4)
('2009-09-03', 22)
--------------------
2009-09-06
('2009-09-06', 33)
-------------------

The "things" list is a contrived example. In a real world situation this could be a database cursor object or a CSV reader object. Any iterable object can be used.

Here's a closer look at what groupby is doing using the Python interactive shell:

>>> iterator = groupby(things, itemgetter(0))
>>> iterator
<itertools.groupby object at 0x95d3acc>
>>> iterator.next()
('2009-09-02', <itertools._grouper object at 0x95e0d0c>)
>>> iterator.next()
('2009-09-03', <itertools._grouper object at 0x95e0aec>)

You can see how a key and sub-iterator are returned for each pass through the groupby iterator.

groupby is a handy tool to have under your belt. Think of it whenever you need to split up a dataset by some criteria.