Why does Pandas iterate over DataFrame columns by default?

A DataFrame is primarily a column-based data structure. Under the hood, the data inside the DataFrame is stored in blocks. Roughly speaking there is one block for each dtype. Each column has one dtype. So accessing a column can be done by selecting the appropriate column from a single block. In contrast, selecting a single row requires selecting the appropriate row from each block and then forming a new Series and copying the data from each block's row into the Series. Thus, iterating through rows of a DataFrame is (under the hood) not as natural a process as iterating through columns.

If you need to iterate through the rows, you still can, however, by calling df.iterrows(). You should avoid using df.iterrows if possible for the same reason why it's unnatural -- it requires copying which makes the process slower than iterating through columns.

There's a decent explanation in the docs - iteration for Pandas DataFrames is meant to be "dict-like," so the iteration is over the keys (the columns).

Arguably it's a little confusing that iteration for Series is over the values, but as the docs note, that's because they are are more "array-like".

Why does Pandas iterate over DataFrame columns by default?

Related

Recent Posts