分割数据框 [英] split a dataframe

查看：88 发布时间：2020/10/17 0:30:01 pandas dataframe split threshold

本文介绍了分割数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

打印（df）

推荐答案

以下是使用 numba 的方法来加快我们的 for循环：

Here's a method using numba to speed up our for loop:

我们检查何时达到限制，并重置总个计数，我们从numba import njit
$ b $中分配一个新的组：

We check when our limit is reached and we reset the total count and we assign a new group:

from numba import njit

@njit
def cumsum_reset(array, limit):
    total = 0
    counter = 0 
    groups = np.empty(array.shape[0])
    for idx, i in enumerate(array):
        total += i
        if total >= limit or array[idx-1] == limit:
            counter += 1
            groups[idx] = counter
            total = 0
        else:
            groups[idx] = counter
    
    return groups

grps = cumsum_reset(df['B'].to_numpy(), 50)

for _, grp in df.groupby(grps):
    print(grp, '\n')

输出

时间：

# create dataframe of 600k rows
dfbig = pd.concat([df]*100000, ignore_index=True)
dfbig.shape

(600000, 2)

# Erfan
%%timeit
cumsum_reset(dfbig['B'].to_numpy(), 50)

4.25 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Daniel Mesejo
def daniel_mesejo(th, column):
    cumsum = column.cumsum()
    bins = list(range(0, cumsum.max() + 1, th))
    groups = pd.cut(cumsum, bins)
    
    return groups

%%timeit
daniel_mesejo(50, dfbig['B'])

10.3 s ± 2.17 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

结论， numba for循环为24〜x更快。

Conclusion, the numba for loop is 24~ x faster.

这篇关于分割数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

分割数据框 [英] split a dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

分割数据框 [英] split a dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭