排序和附加大型数据框的更快方法 [英] Faster ways to sort and append large dataframe
问题描述
我正在尝试对每个产品销售日和产品ID的一些销售数据进行排序,然后我想对熊猫进行一些统计.有没有一种有效的方法可以做到这一点?我的数据集有数百万行.
I’m trying to sort some sales data per product day of sale and product ID, and then I would like to compute some statistics with pandas. Is there an efficient way to do this? My dataset has millions of rows.
数据集如下所示(df1,3.000.000 +行):-------------------------------- ---------------------------------------------
The dataset looks like this (df1, 3.000.000 + rows): -----------------------------------------------------------------------------
|productID |productCategory |expiryDate |Price |Currency |quantitySold| daySold|
|Fdgd4 |Ergdgf |15sep2020 00:00:00 |125 |USD |5675 |18feb2017 12:45:17|
|Sd23454 |sdfdsr |17mar2018 00:00:00 |39 |USD |654 |31jan2017 12:45:17|
|Fdgd4 |Ergdgf |15sep2020 00:00:00 |125 |USD |300 |18feb2017 09:17:15|
|Sd23454 |sdfdsr |17mar2018 00:00:00 |39 |USD |200 |31jan2017 15:30:35|
|Rt4564 |fdgdf |13jun2018 00:00:00 |45 |USD |1544 |31feb2017 13:25:31|
|Fdgd4 |Ergdgf |15sep2020 00:00:00 |125 |USD |4487 |18mar2017 09:17:15|
|Sd23454 |sdfdsr |17mar2018 00:00:00 |39 |USD |7895 |31aug2017 15:30:35|
我想对每个productID每天的一些简单统计信息进行排序.因此,我认为我的代码应该首先对每天的行进行排序,然后对每种产品进行排序.然后,它应该计算统计信息并将其添加到表中.
I would like to sort compute some simple statistics per productID, per day. So I think my code should first order the rows per day and then per product. Then it should compute the statistics and add them to the table.
此示例中的结果为(df2):
The result in this example is(df2):
|productID |productCategory |expiryDate |Price |Currency |quantitySold |daySold |volSTD |totalVol |totalRevenue|
------------------------------------------------------------------------**
|Sd23454 |sdfdsr |17mar2018 00:00:00 39 |USD |654 |31jan2017 12:45:17 |321.02 |854 |33306 |
|Fdgd4 |Ergdgf |15sep2020 00:00:00 125 |USD |300 |31jan2017 15:30:35 |0 |300 |37500 |
|Fdgd4 |Ergdgf |15sep2020 00:00:00 125 |USD |5675 |18feb2017 12:45:17 |840.04 |10162 |1270250|
|Rt4564 |fdgdf |13jun2018 00:00:00 45 |USD |1544 |31feb2017 13:25:31 |0 |544 |69480 |
|Sd23454 |sdfdsr |17mar2018 00:00:00 39 |USD |7895 |31aug2017 15:30:35 |0 |7895 |307905 |
我在pandas中使用了一个嵌套的for循环,它给出了预期的结果,但是确实要花很长时间(几个小时). 我正在寻找一种获得此结果的快速方法.
I use a nested for loop in pandas which gives the expected result but it does take way to long (a couple of hours). I am looking for a fast way to get this result.
我的代码(可能是您见过的最糟糕的代码):
My code (probably on of the worst you’ve ever seen):
uniqueDays = df1.daySold.unique()
numberOfDays = df1.shape[0]
df_results = pd.Dataframe(columns=[‘productID’, ‘productCategory’, ‘expiryDate Price’, ‘Currency’, ‘quantitySold’, ‘daySold’, ‘volSTD’, ‘totalVol’, ‘totalRevenue’])
For i in range(0, numberOfDays):
temp1 = df1.loc[df1[‘daySold’]== uniqueDays[i]]
uniqueID = temp1.productID.unique()
NumberOfUniqueID = uniqueID.shape[0]
for j in range(0, NumberOfUniqueID):
temp2 = temp1.loc[temp1[‘daySold’]== uniqueID[j]
volSTD = temp2.quantitySold.std()
totalVol = temp2.quantitySold.sum()
totalRevenue = temp2.quantitySold.dot(temp2.price)
temp3 = temp2.iloc[0] # it does not matter which row I pick
temp3[‘volSTD’] = volSTD
temp3[‘totalVol’] = totalVol
temp3[‘totalRevenue’] = totalRevenue
df_results = df_results.append(temp3)
这给了我想要的结果,但是速度太慢了.特别是将列(volSTD,totalVol和totalRevenue)添加到temp3并将temp3附加到df_results总共需要81.3%的处理时间.
This gives me the results I want I want, but it is way too slow. In particular adding the columns (volSTD, totalVol and totalRevenue) to temp3 and appending temp3 to df_results take a combined 81.3% of the processing time.
有人能更快地做到这一点吗?使用向量?还是填充现有数据框而不是附加数据框?
Does anyone have a faster way to do this? Using vectors? Or filling an existing dataframe instead of appending?
谢谢
推荐答案
groupby
怎么样?可以说,它比循环更有效地处理迭代,并且代码更短,可读性更好.您将在daySold
和productID
上分组.显然,这是模拟数据,但是您想先将daySold
转换为datetime
对象,以便您可以轻松地对它进行分组-我只是保留一天的时间,但是可以根据需要保留时间:
How about groupby
? It handles, so to speak, the iterations much more efficiently than loops and in much shorter and readable code. You would group on daySold
and productID
. This is obviously mock data, but you would want to turn your daySold
into a datetime
object first so you can easily group on it - I just kept the day, but you could keep the time if needed:
df.daySold=pd.to_datetime(df.daySold.apply(lambda x: x[:9]),format="%d%b%Y")
然后它只是一个衬套.使用groupby
对象,您可以传递许多不同的聚合调用.
Then it is just a one-liner. With the groupby
object you can pass a number of different aggregation calls.
df.groupby(['daySold','productID']).agg({'quantitySold':[sum,np.std],'Price':[sum,np.std]})
quantitySold Price
sum std sum std
daySold productID
2017-01-31 Sd23454 854 321.026479 78 0.0
2017-02-13 Rt4564 1544 NaN 45 NaN
2017-02-18 Fdgd4 5975 3800.698949 250 0.0
2017-03-18 Fdgd4 4487 NaN 125 NaN
2017-08-30 Sd23454 7895 NaN 39 NaN
您可以使用groupby对象应用各种功能,包括现成的功能和您自己定义的功能.
You can use the groupby object to apply all manner of functions, off the shelf ones and ones you define yourself.
因此,您可以做一个点积,需要一个数据框的两个列/数组,如下所示:
So you could do a dot product, requiring two columns / arrays of a dataframe, like so:
def dotter(df):
return np.sum(df.quantitySold*df.Price)
## or if you want to use numpy--may be faster for large datasets:
#return np.dot(df.quantitySold,df.Price)
使用groupby对象的apply方法调用它:
Call it by using apply method of groupby object:
df.groupby(['daySold','productID']).apply(dotter)
daySold productID
2017-01-31 Sd23454 33306
2017-02-13 Rt4564 69480
2017-02-18 Fdgd4 746875
2017-03-18 Fdgd4 560875
2017-08-30 Sd23454 307905
dtype: int64
这篇关于排序和附加大型数据框的更快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!