比较 pandas 数据框中具有上百万行的行和上一行的最快方法 [英] Fastest way to compare row and previous row in pandas dataframe with millions of rows

查看:95
本文介绍了比较 pandas 数据框中具有上百万行的行和上一行的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找解决方案,以加快编写的函数来遍历pandas数据框并比较当前行和上一行之间的列值.

I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row.

例如,这是我的问题的简化版本:

As an example, this is a simplified version of my problem:

   User  Time                 Col1  newcol1  newcol2  newcol3  newcol4
0     1     6     [cat, dog, goat]        0        0        0        0
1     1     6         [cat, sheep]        0        0        0        0
2     1    12        [sheep, goat]        0        0        0        0
3     2     3          [cat, lion]        0        0        0        0
4     2     5  [fish, goat, lemur]        0        0        0        0
5     3     9           [cat, dog]        0        0        0        0
6     4     4          [dog, goat]        0        0        0        0
7     4    11                [cat]        0        0        0        0

目前,我有一个函数循环运行,并根据上一行的User是否已更改以及计算值的差异来计算'newcol1'和'newcol2'的值. 'Time'的值大于1.它还会查看存储在'Col1'和'Col2'的数组中的第一个值,如果这些值更新'newcol3'和'newcol4'自上一行以来发生了变化.

At the moment I have a function which loops through and calculates values for 'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1. It also looks at the first value in the arrays stored in 'Col1' and 'Col2' and updates 'newcol3' and 'newcol4' if these values have changed since the previous row.

这是我目前正在做的伪代码(因为我已经简化了我尚未测试的问题,但它与我在ipython Notebook中实际所做的非常相似):

Here's the pseudo-code for what I'm doing currently (since I've simplified the problem I haven't tested this but it's pretty similar to what I'm actually doing in ipython notebook):

 def myJFunc(df):
...     #initialize jnum counter
...     jnum = 0;
...     #loop through each row of dataframe (not including the first/zeroeth)
...     for i in range(1,len(df)):
...             #has user changed?
...             if df.User.loc[i] == df.User.loc[i-1]:
...                     #has time increased by more than 1 (hour)?
...                     if abs(df.Time.loc[i]-df.Time.loc[i-1])>1:
...                             #update new columns
...                             df['newcol2'].loc[i-1] = 1;
...                             df['newcol1'].loc[i] = 1;
...                             #increase jnum
...                             jnum += 1;
...                     #has content changed?
...                     if df.Col1.loc[i][0] != df.Col1.loc[i-1][0]:
...                             #record this change
...                             df['newcol4'].loc[i-1] = [df.Col1.loc[i-1][0], df.Col2.loc[i][0]];
...             #different user?
...             elif df.User.loc[i] != df.User.loc[i-1]:
...                     #update new columns
...                     df['newcol1'].loc[i] = 1; 
...                     df['newcol2'].loc[i-1] = 1;
...                     #store jnum elsewhere (code not included here) and reset jnum
...                     jnum = 1;

我现在需要将此功能应用到几百万行,而且速度不可能太慢,所以我试图找出加快速度的最佳方法.我听说Cython可以提高功能的速度,但是我没有使用它的经验(而且我对Pandas和python还是陌生的).是否可以将数据帧的两行作为函数的参数传递,然后使用Cython加快速度,还是有必要创建一个带有"diff"值的新列,以便该函数仅读取和写入数据为了一次使用Cython而受益?其他速度技巧将不胜感激!

I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python). Is it possible to pass two rows of a dataframe as arguments to the function and then use Cython to speed it up or would it be necessary to create new columns with "diff" values in them so that the function only reads from and writes to one row of the dataframe at a time, in order to benefit from using Cython? Any other speed tricks would be greatly appreciated!

(关于使用.loc,我比较了.loc,.iloc和.ix,这个速度稍快一些,所以这是我目前使用的唯一原因)

(As regards using .loc, I compared .loc, .iloc and .ix and this one was marginally faster so that's the only reason I'm using that currently)

(此外,实际上我的User列是unicode而不是int,对于快速比较可能会造成问题)

(Also, my User column in reality is unicode not int, which could be problematic for speedy comparisons)

推荐答案

我一直在和Andy一起思考,只是添加了groupby,我认为这是对Andy的回答的补充.每当您执行diffshift时,添加groupby都会产生将NaN放在第一行的效果. (请注意,这并不是尝试给出确切的答案,只是勾勒出一些基本技术.)

I was thinking along the same lines as Andy, just with groupby added, and I think this is complementary to Andy's answer. Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff or shift. (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)

df['time_diff'] = df.groupby('User')['Time'].diff()

df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )

df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()

   User  Time                 Col1  time_diff Col1_0 Col1_0_prev
0     1     6     [cat, dog, goat]        NaN    cat         NaN
1     1     6         [cat, sheep]          0    cat         cat
2     1    12        [sheep, goat]          6  sheep         cat
3     2     3          [cat, lion]        NaN    cat         NaN
4     2     5  [fish, goat, lemur]          2   fish         cat
5     3     9           [cat, dog]        NaN    cat         NaN
6     4     4          [dog, goat]        NaN    dog         NaN
7     4    11                [cat]          7    cat         dog

作为安迪关于存储对象的观点的补充,请注意,我在这里所做的是提取列表列的第一个元素(并添加一个转换版本).这样,您只需执行一次昂贵的提取操作即可,然后可以遵循标准的熊猫方法.

As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.

这篇关于比较 pandas 数据框中具有上百万行的行和上一行的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆