我可以在 pandas 中执行动态行累加吗? [英] Can I perform dynamic cumsum of rows in pandas?
问题描述
如果我具有以下数据框,则派生如下:df = pd.DataFrame(np.random.randint(0, 10, size=(10, 1)))
If I have the following dataframe, derived like so: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 1)))
0
0 0
1 2
2 8
3 1
4 0
5 0
6 7
7 0
8 2
9 2
是否有一种有效的方法cumsum
有限制的行,并且每次达到此限制时,就开始一个新的cumsum
.达到每个限制后(无论多少行),都会创建一行,其中包含总和.
Is there an efficient way cumsum
rows with a limit and each time this limit is reached, to start a new cumsum
. After each limit is reached (however many rows), a row is created with the total cumsum.
下面,我创建了一个执行此操作的函数示例,但是它非常慢,尤其是当数据帧变得非常大时. 我不喜欢我的函数正在循环,我正在寻找一种使它更快的方法(我猜是没有循环的方法).
Below I have created an example of a function that does this, but it's very slow, especially when the dataframe becomes very large. I don't like that my function is looping and I am looking for a way to make it faster (I guess a way without a loop).
def foo(df, max_value):
last_value = 0
storage = []
for index, row in df.iterrows():
this_value = np.nansum([row[0], last_value])
if this_value >= max_value:
storage.append((index, this_value))
this_value = 0
last_value = this_value
return storage
如果您像这样朗读我的功能:foo(df, 5)
在上述情况下,它将返回:
If you rum my function like so: foo(df, 5)
In in the above context, it returns:
0
2 10
6 8
推荐答案
无法避免循环,但是可以使用numba
的njit
:
The loop cannot be avoided, but it can be parallelized using numba
's njit
:
from numba import njit, prange
@njit
def dynamic_cumsum(seq, index, max_value):
cumsum = []
running = 0
for i in prange(len(seq)):
if running > max_value:
cumsum.append([index[i], running])
running = 0
running += seq[i]
cumsum.append([index[-1], running])
return cumsum
这里的索引是必需的,前提是您的索引不是数字/单调递增的.
The index is required here, assuming your index is not numeric/monotonically increasing.
%timeit foo(df, 5)
1.24 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dynamic_cumsum(df.iloc(axis=1)[0].values, df.index.values, 5)
77.2 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
如果索引为Int64Index
类型,则可以将其缩短为:
If the index is of Int64Index
type, you can shorten this to:
@njit
def dynamic_cumsum2(seq, max_value):
cumsum = []
running = 0
for i in prange(len(seq)):
if running > max_value:
cumsum.append([i, running])
running = 0
running += seq[i]
cumsum.append([i, running])
return cumsum
lst = dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
pd.DataFrame(lst, columns=['A', 'B']).set_index('A')
B
A
3 10
7 8
9 4
%timeit foo(df, 5)
1.23 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
71.4 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
njit
功能性能
njit
Functions Performance
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.randint(0, 10, size=(n, 1))),
kernels=[
lambda df: list(cumsum_limit_nb(df.iloc[:, 0].values, 5)),
lambda df: dynamic_cumsum2(df.iloc[:, 0].values, 5)
],
labels=['cumsum_limit_nb', 'dynamic_cumsum2'],
n_range=[2**k for k in range(0, 17)],
xlabel='N',
logx=True,
logy=True,
equality_check=None # TODO - update when @jpp adds in the final `yield`
)
对数-对数图显示生成器功能对于较大的输入更快:
The log-log plot shows that the generator function is faster for larger inputs:
一个可能的解释是,随着N的增加,在dynamic_cumsum2
中追加到不断增长的列表的开销变得显着.而cumsum_limit_nb
只需yield
.
A possible explanation is that, as N increases, the overhead of appending to a growing list in dynamic_cumsum2
becomes prominent. While cumsum_limit_nb
just has to yield
.
这篇关于我可以在 pandas 中执行动态行累加吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!