In theory, theory and practice are the same. In practice they are different.

Hey pips!

Last time we looked at the tricky topic of generators.

def find_flagged_users(users):
    for user_id in users:
        user = find_user(user_id)
        if user.expired or user.flags.size > 0:
            yield user

That there is a generator function since it uses yield. Let’s turn it into an equivalent generator expression:

flagged_users = (
    find_user(user_id)
    for user_id in users
    if user.expired or user.flags.size > 0
)

Now flagged_users stores a generator object. We can find its values by iterating over it:

>>> for user in flagged_users:
        print(user.name)
Bob
Jeff
Rick

Coolio. Now let’s suppose we wanted to iterate over it again, this time in a set comprehension:

>>> {*user.flags for user in flagged_users}
set()

Huh – An empty set? What happened to our users?

This is the other quirk of generators. Remember that they’re lazy-loading, so they only compute each on the fly as they’re iterated over. When iteration reaches the end, the generator is consumed. At this point, further iteration won’t do anything.

>>> gen = (i for i in range 4)
>>> l = list(gen)  # constructing a list iterates over its input
[0, 1, 2, 3]
>>> s = set(gen)  # as does constructing a set
{}

Attempting further iteration raises the StopIteration exception (which is what tells a for loop when to stop!).

>>> next(gen)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

This is super important – generators can only be used once.[^once] That’s the main tradeoff between a list and a generator. If you plan on reusing, mutating and/or passing the collection around, stick to a list. If you only need the items one-by-one as they come in, like when handling a task queue, a generator can be more performant.

[^once]: *once in their entirety.

def process_list(data: list):
    n = len(data)
    highest = max(data)

    return [(each - highest)**2 for each in data] / n

def process_gen(data: Generator[]):
    for each in data:
        score += each ** 0.5

A good example of this is when using the built-in Python iterable functions, like len(), sum(), max(), zip() – yeah, those. They can take any kind of iterable, so a generator works just fine for them! Why expend overhead and memory constructing a list, and only then passing it to the function, when you can just let the function grab the values as it needs them?

# unnecessary list
>>> sum([player.hours for player in data])
1679

# cleaner and faster
>>> sum(player.hours for player in data)
1679

Notice the generator expression’s kinda ‘embedded’ inside the parentheses of the function call, so you don’t need an extra pair like sum((player.hours for player in data)). That would be, well, horrific.

[!Tip] For small collection sizes, this is totally a micro-optimisation. Negligible impact on performance, lmao. But hey, it does improve readability!

One more thing it’s good to just be aware of – how you can write nested generator functions. It’ll unlikely you’ll ever need them unless you’re doing some, idk, strange tree exploration. But cool to have in your toolkit.

Regular functions return just 1 object, but generator functions return a generator with multiple objects. So, if we had a generator function which wanted to call another generator inside it…

def pos_ints():
    for n in itertools.count(1):
        yield n

def naturals():
    yield 0
    yield pos_ints()  # <-- careful with this line!
    yield float("inf")

If we get the output of this, it’s not the entire sequence, but just 2 objects:

>>> list(naturals())
[0, <generator object pos_ints at 0x000001D142D6F420>, inf]

That’s because the line yield pos_ints() is yielding the entire generator returned by pos_ints() – not the individual values of the generator. So, to sort of ‘unpack’ it, we need to manually iterate over it:

def naturals():
    yield 0
    for each in pos_ints():
        yield each
    yield float("inf")

This is a bit verbose. Maybe you’d wonder why you can’t unpack the values with *.

def naturals():
    yield 0
    yield *pos_ints()
    yield float("inf")

Well, * has to unpack the values to somewhere, and a yielded value isn’t exactly a valid context. Also, this would still return the entire sequence, not yield the values one-by-one – it’d be pretty peculiar if this were the case:

yield 1
yield 2
yield 3
...

# would be weird if this were the same:
yield *[1, 2, 3, ...]

Instead, Python provides an intuitive keyword combo for achieving this – it’s yield from!

def naturals():
    yield 0
    yield from pos_ints()
    yield float("inf")

It’s essentially ‘passing control’ of the generator and its yielded values to this nested generator. When pos_ints() is exhausted, naturals() gets control back and proceeds to the next value, float("inf"). (PSA: famously not a number, this was for illustrative purposes only :P)

>>> list(naturals)
...
# ...the sequence is infinite, lmao.

Ok, for a better example then:

>>> def inner(word: str):
        for letter in word:
            yield letter

>>> def outer(word: str):
        yield from inner(word)
        yield "!"

>>> outer("never")
never!

Question? Bug needs fixing? Or just want to nerd out over programming?
Drop a message in the GitHub discussion for this issue.