分割数据框 [英] split a dataframe
本文介绍了分割数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
打印(df)
A B
0 10
1 30
2 50
3 20
4 10
5 30
A B
0 10
1 30
A B
2 50
A B
3 20
4 10
5 30
推荐答案
以下是使用 numba
的方法来加快我们的 for循环
:
Here's a method using numba
to speed up our for loop
:
我们检查何时达到限制,并重置总
个计数,我们从numba import njit
$ b $中分配一个新的组
:
We check when our limit is reached and we reset the total
count and we assign a new group
:
from numba import njit
@njit
def cumsum_reset(array, limit):
total = 0
counter = 0
groups = np.empty(array.shape[0])
for idx, i in enumerate(array):
total += i
if total >= limit or array[idx-1] == limit:
counter += 1
groups[idx] = counter
total = 0
else:
groups[idx] = counter
return groups
grps = cumsum_reset(df['B'].to_numpy(), 50)
for _, grp in df.groupby(grps):
print(grp, '\n')
输出
A B
0 0 10
1 1 30
A B
2 2 50
A B
3 3 20
4 4 10
5 5 30
时间:
# create dataframe of 600k rows
dfbig = pd.concat([df]*100000, ignore_index=True)
dfbig.shape
(600000, 2)
# Erfan
%%timeit
cumsum_reset(dfbig['B'].to_numpy(), 50)
4.25 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Daniel Mesejo
def daniel_mesejo(th, column):
cumsum = column.cumsum()
bins = list(range(0, cumsum.max() + 1, th))
groups = pd.cut(cumsum, bins)
return groups
%%timeit
daniel_mesejo(50, dfbig['B'])
10.3 s ± 2.17 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
结论, numba
for循环为24〜x更快。
Conclusion, the numba
for loop is 24~ x faster.
这篇关于分割数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文