如何有效地迭代 pandas 数据帧的连续块 [英] How to iterate over consecutive chunks of Pandas dataframe efficiently

查看:53
本文介绍了如何有效地迭代 pandas 数据帧的连续块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大数据框(几百万行).

I have a large dataframe (several million rows).

我希望能够对它进行分组操作,而只是按任意连续的(最好大小相等)的行子集进行分组,而不是使用各个行的任何特定属性来决定它们要进入的组.

I want to be able to do a groupby operation on it, but just grouping by arbitrary consecutive (preferably equal-sized) subsets of rows, rather than using any particular property of the individual rows to decide which group they go to.

用例:我想通过IPython中的并行映射将函数应用于每一行.哪行进入哪个后端引擎都没有关系,因为该函数一次基于一行来计算结果. (至少从概念上讲;实际上是矢量化的.)

The use case: I want to apply a function to each row via a parallel map in IPython. It doesn't matter which rows go to which back-end engine, as the function calculates a result based on one row at a time. (Conceptually at least; in reality it's vectorized.)

我想出了类似这样的东西:

I've come up with something like this:

# Generate a number from 0-9 for each row, indicating which tenth of the DF it belongs to
max_idx = dataframe.index.max()
tenths = ((10 * dataframe.index) / (1 + max_idx)).astype(np.uint32)

# Use this value to perform a groupby, yielding 10 consecutive chunks
groups = [g[1] for g in dataframe.groupby(tenths)]

# Process chunks in parallel
results = dview.map_sync(my_function, groups)

但是这似乎很漫长,并且不能保证大小相等的块.尤其是当索引是稀疏的或非整数的或类似的东西时.

But this seems very long-winded, and doesn't guarantee equal sized chunks. Especially if the index is sparse or non-integer or whatever.

有什么更好的建议吗?

谢谢!

推荐答案

在实践中,您不能保证大小相等的块.行数(N)可能是素数,在这种情况下,您只能获得1或N相等大小的块.因此,现实世界中的分块通常使用固定的大小,最后允许使用较小的块.我倾向于将数组传递给groupby.开始于:

In practice, you can't guarantee equal-sized chunks. The number of rows (N) might be prime, in which case you could only get equal-sized chunks at 1 or N. Because of this, real-world chunking typically uses a fixed size and allows for a smaller chunk at the end. I tend to pass an array to groupby. Starting from:

>>> df = pd.DataFrame(np.random.rand(15, 5), index=[0]*15)
>>> df[0] = range(15)
>>> df
    0         1         2         3         4
0   0  0.746300  0.346277  0.220362  0.172680
0   1  0.657324  0.687169  0.384196  0.214118
0   2  0.016062  0.858784  0.236364  0.963389
[...]
0  13  0.510273  0.051608  0.230402  0.756921
0  14  0.950544  0.576539  0.642602  0.907850

[15 rows x 5 columns]

我故意通过将索引设置为0来使索引无用,我们只需确定大小(此处为10),然后按整数除以一个数组即可:

where I've deliberately made the index uninformative by setting it to 0, we simply decide on our size (here 10) and integer-divide an array by it:

>>> df.groupby(np.arange(len(df))//10)
<pandas.core.groupby.DataFrameGroupBy object at 0xb208492c>
>>> for k,g in df.groupby(np.arange(len(df))//10):
...     print(k,g)
...     
0    0         1         2         3         4
0  0  0.746300  0.346277  0.220362  0.172680
0  1  0.657324  0.687169  0.384196  0.214118
0  2  0.016062  0.858784  0.236364  0.963389
[...]
0  8  0.241049  0.246149  0.241935  0.563428
0  9  0.493819  0.918858  0.193236  0.266257

[10 rows x 5 columns]
1     0         1         2         3         4
0  10  0.037693  0.370789  0.369117  0.401041
0  11  0.721843  0.862295  0.671733  0.605006
[...]
0  14  0.950544  0.576539  0.642602  0.907850

[5 rows x 5 columns]

当索引与之不兼容时,基于​​切片DataFrame的方法可能会失败,尽管您始终可以使用.iloc[a:b]忽略索引值并按位置访问数据.

Methods based on slicing the DataFrame can fail when the index isn't compatible with that, although you can always use .iloc[a:b] to ignore the index values and access data by position.

这篇关于如何有效地迭代 pandas 数据帧的连续块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆