Can you spot the bug in this Python code?

This blog has been translated into Chinese, and can be read here

I ran into an interesting quirk while parsing some text today. Before I dive
in, let’s establish some background. I’m parsing some comma-delimited data from
a text file that looks like this:

0x5d,3,0x10060,(0x00800513),x10,0x8
0x2d41854,3,0x80001a14,(0xbffd)
0x2d41855,3,0x80001a12,(0x0001)
0x2d41856,3,0x80001a14,(0xbffd)
0x2d41857,3,0x80001a12,(0x0001)

It has variable width hex values where each line has at least three fields, but
some have more. I only really care about the first and third fields. In my
mind, I broke down the parsing into three pieces:

For each line in the file…
split the line on the commas into a list, then…
select the first and third elements, and convert them into integers

It seemed simple enough that I could make my pandas
DataFrame
in a line or two.

Here’s some simplified code that I produced:

import pandas as pd
def load_file_as_df(filename: str) -> pd.DataFrame:
    with open(filename) as fp:
        return pd.DataFrame.from_records(
            (
                (int(tsc, 16), int(pc, 16))
                for line in fp
                for tsc, _, pc, *_ in line.strip().split(",")
            ),
            columns=["Timestamp", "PC"]
        )

Can you spot the bug? Neither could I. I’ll explain the code I wrote
piece-by-piece, and then go into why I was wrong.

Explaining my Code

CSV files are lists of lists

I simple view of CSV data is to look at it like lists of lists. So that meant
looking at elements in a nested list. My go-to pattern for looking at a
flattened-view of nested lists comes from this StackOverflow
post. With the main idea reproduced here
for posterity

nested_lists = [[1,2,3],[4,5,6],[7,8,9]]
flattened_list = [element for sublist in nested_lists for element in sublist]

As someone who came to Python after being exposed to C and C++, learning
about nested comprehensions really gave me the feeling that Python is
just machine-interpretable pseudocode, for better or worse. The nested
list comprehension that generates the following bytecode:

  2           0 RESUME                   0
              2 BUILD_LIST               0
              4 LOAD_FAST                0 (.0)
        >>    6 FOR_ITER                 9 (to 26)
              8 STORE_FAST               1 (sublist)
             10 LOAD_FAST                1 (sublist)
             12 GET_ITER
        >>   14 FOR_ITER                 4 (to 24)
             16 STORE_FAST               2 (element)
             18 LOAD_FAST                2 (element)
             20 LIST_APPEND              3
             22 JUMP_BACKWARD            5 (to 14)
        >>   24 JUMP_BACKWARD           10 (to 6)
        >>   26 RETURN_VALUE

Then we extend the idea by combining with some extended iterable
unpacking and you get the following
snippet from above:

pd.DataFrame.from_records(
    (
        (int(tsc, 16), int(pc, 16))
        for line in fp
        for tsc, _, pc, *_ in line.strip().split(",")
    ),
    columns=["Timestamp", "PC"]
)

The Mistake

It turns out, the python interpreter can’t quite combine iterable
decomposition and comprehensions the way that I thought, you have
to wrap the .split(",") call in what looks like another list:

pd.DataFrame.from_records(
    (
        (int(tsc, 16), int(pc, 16))
        for line in fp
        for tsc, _, pc, *_ in [line.strip().split(",")]
    ),  #            this guy⤴          and this guy⤴
    columns=["Timestamp", "PC"]
)

This broke my brain a little bit, because the .split(",") is already a list.
Doesn’t wrapping it in the square list mean it’s now a double-nested list? I didn’t
understand. I went to Compiler Explorer looking for answers.
Here’s the diff of the bytecode for the correct generator expression and the
the one I wrote:

Not the most intuitive, but we have a better picture of what’s going on here. Can you
see it yet? The problem in the old code is that it silently converted the.split()
to an iterator BEFORE unpacking. I’m not certain, but I believe this is inherently
tied to implementation details established when the PEP for list comprehensions was
originally proposed.

I actually figured this out with the help of CPython contributor
Crowthebird demonstrated the
problem with the code rewritten without using comprehensions.

The incorrect approach does this:

splitted = ["0x2d41854", "3", "0x80001a14", "(0xbffd)"]
for tsc, _, pc, *_ in iter(splitted):
    data.append((int(tsc, 16), int(pc, 16)))

The correct approach does this:

tsc, _, pc, *_ = splitted
data.append((int(tsc, 16), int(pc, 16)))

Could this be something that you fix?

This is a fault in my understanding of the Python interperter, not a fault
with the interpreter itself. I don’t think changing the language to fit my
intuitive syntax would be for the better because it makes it difficult to
differentiate when the container should be unpacked and re-used in the nested
case, versus the erroneous formation of a list comprehension returning tuples,
mentioned by Guido as the disallowed syntax in PEP
202.

Source link