Python generators

Summary from reading: “Generator Tricks for Systems Programmers” and “A Curious Course on Coroutines and Concurrency” by David Beazley

Syntax

In Python, we have [list], {dictionary}, and (generator)

Iterators vs Generators

Iterator is an object with ˍˍiterˍˍ() and next() defined, and raise StopIteration upon next() when there is no more things to iterate over, e.g.

class countdown(object):
	def __init__(self,start):
		self.count = start
	def __iter__(self):
		return self
	def next(self):
		if self.count <= 0:
			raise StopIteration
		r = self.count
		self.count -= 1
		return r
c = countdown(5)
for i in c:
	print i

Generator is a function with yield, e.g.

def countdown(n)
	while n > 0:
		yield n
		n -= 1
for i in countdown(5):
	print i

When a generator returns, it raises StopIteration implicitly. Similar to iterators, we can call a generator with next(), e.g.

c = countdown(5)
print c.next()

Generators as a pipeline

An example to use a generator to analyse a web log:

An apache log file is lines of space-delimited fields, with the last field is either the byte count or a “-“ if the count if not available. The imperative programming model suggests this solution:

wwwlog = open("access-log")
total = 0
for line in wwwlog:
	bytestr = line.rsplit(None,1)[1]
	if bytestr != '-':
		total += int(bytestr)
print "Total", total

The generator expression solution is the following:

wwwlog     = open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes      = (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes)

To extend this program, we can scan over a directory of log files, possibly compressed, and print the sum of bytes of all files that patches a pattern:

import os
import fnmatch
def gen_find(filepat,top):
	for path, dirlist, filelist in os.walk(top):
		# path,dirlist,filelist = Current dir, list of subdir, list of files
		for name in fnmatch.filter(filelist,filepat):
			yield os.path.join(path,name)  
import gzip, bz2
def gen_open(filenames):
	for name in filenames:
		if name.endswith(".gz"):
			yield gzip.open(name)
		elif name.endswith(".bz2"):
			yield bz2.BZ2File(name)
		else:
			yield open(name)
def gen_cat(sources):
	for s in sources:
		for item in s:
			yield item
import re
def gen_grep(pat, lines):
	patc = re.compile(pat)
	for line in lines:
		if patc.search(line): yield line

pat        = r"somepattern"
logdir     = "/some/dir/"

filenames  = gen_find("access-log*",logdir)
logfiles   = gen_open(filenames)
loglines   = gen_cat(logfiles)
patlines   = gen_grep(pat,loglines)
bytecolumn = (line.rsplit(None,1)[1] for line in patlines)
bytes      = (int(x) for x in bytecolumn if x != '-')

print "Total", sum(bytes)

Killing Generators

A generator object can be forced to close before StopIteration, by calling object.close(). In the meantime, that generator object receives a GeneratorExit event, so that clean up can be done internally, e.g.

import time
def follow(thefile):
	thefile.seek(0,2)	# go to the end of the file
	try:
		while True:
			line = thefile.readline()
			if not line:
				time.sleep(0.1)	# sleep briefly
				continue
			yield line
	except GeneratorExit:
		print "Shutting down"

But calling close() is supported only for generators in the same thread, otherwise a user-provided semaphore is needed.

Coroutines

New in Python 2.5, we have coroutines, which is same as generators, that uses yield command.

Instead of output by yield, coroutines are taking input from yield, example as follows:

def grep(pattern):
	print "Looking for %s" % pattern
	while True:
		line = (yield)
		if pattern in line:
			print line,
g = grep("python")
g.next()
g.send("python generators rock!")

The coroutine is a function that takes yield as an expression. The value that yield gives is the argument from the send() call from outside. Before one can do send(), we need to call the coroutine with next() so as to prime it, i.e. proceed the coroutine function to the position of yield expression.

Simply speaking, using generators is a “pull” of data but using coroutines is a “push” of data.

Similar to generators, we can kill a coroutine with close() call and the coroutine will receive a GeneratorExit event.