为什么 pandas 默认情况下会遍历DataFrame列? [英] Why does Pandas iterate over DataFrame columns by default?

查看:87
本文介绍了为什么 pandas 默认情况下会遍历DataFrame列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图了解熊猫某些功能的设计原理.

Trying to understand the design rationale behind some of Pandas' features.

如果我有一个3560行18列的DataFrame,那么

If I have a DataFrame with 3560 rows and 18 columns, then

len(frame)

是3560,但是

len([a for a in frame])

是18.

也许对于来自R的人来说这很自然;对我来说,感觉不太像"Pythonic".是否在某处介绍了熊猫的基本设计原理?

Maybe this feels natural to someone coming from R; to me it doesn't feel very 'Pythonic'. Is there an introduction to the underlying design rationales for Pandas somewhere?

推荐答案

DataFrame主要是基于列的数据结构. 在后台,DataFrame内部的数据存储在块中.大致来说,每个dtype都有一个块. 每个列都有一个dtype .因此,可以通过从单个块中选择适当的列来访问列.相反,选择单个行需要从每个块中选择适当的行,然后形成一个新的系列,并将每个块的行中的数据复制到系列中. 因此,在DataFrame的行中进行迭代(在幕后)不像在列中进行迭代那样自然.

A DataFrame is primarily a column-based data structure. Under the hood, the data inside the DataFrame is stored in blocks. Roughly speaking there is one block for each dtype. Each column has one dtype. So accessing a column can be done by selecting the appropriate column from a single block. In contrast, selecting a single row requires selecting the appropriate row from each block and then forming a new Series and copying the data from each block's row into the Series. Thus, iterating through rows of a DataFrame is (under the hood) not as natural a process as iterating through columns.

如果需要遍历各行,则仍然可以通过调用df.iterrows()来进行.出于相同的原因(如果不自然),应避免使用df.iterrows -它需要复制,这会使该过程比遍历列慢.

If you need to iterate through the rows, you still can, however, by calling df.iterrows(). You should avoid using df.iterrows if possible for the same reason why it's unnatural -- it requires copying which makes the process slower than iterating through columns.

这篇关于为什么 pandas 默认情况下会遍历DataFrame列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆