排序和附加大型数据框的更快方法 [英] Faster ways to sort and append large dataframe

查看:37
本文介绍了排序和附加大型数据框的更快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对每个产品销售日和产品ID的一些销售数据进行排序,然后我想对熊猫进行一些统计.有没有一种有效的方法可以做到这一点?我的数据集有数百万行.

I’m trying to sort some sales data per product day of sale and product ID, and then I would like to compute some statistics with pandas. Is there an efficient way to do this? My dataset has millions of rows.

数据集如下所示(df1,3.000.000 +行):-------------------------------- ---------------------------------------------

The dataset looks like this (df1, 3.000.000 + rows): -----------------------------------------------------------------------------

|productID |productCategory |expiryDate |Price |Currency |quantitySold| daySold| 

|Fdgd4   |Ergdgf |15sep2020 00:00:00 |125 |USD |5675 |18feb2017 12:45:17| 
|Sd23454 |sdfdsr |17mar2018 00:00:00 |39  |USD |654  |31jan2017 12:45:17| 
|Fdgd4   |Ergdgf |15sep2020 00:00:00 |125 |USD |300  |18feb2017 09:17:15| 
|Sd23454 |sdfdsr |17mar2018 00:00:00 |39  |USD |200  |31jan2017 15:30:35| 
|Rt4564  |fdgdf  |13jun2018 00:00:00 |45  |USD |1544 |31feb2017 13:25:31| 
|Fdgd4   |Ergdgf |15sep2020 00:00:00 |125 |USD |4487 |18mar2017 09:17:15| 
|Sd23454 |sdfdsr |17mar2018 00:00:00 |39  |USD |7895 |31aug2017 15:30:35|

我想对每个productID每天的一些简单统计信息进行排序.因此,我认为我的代码应该首先对每天的行进行排序,然后对每种产品进行排序.然后,它应该计算统计信息并将其添加到表中.

I would like to sort compute some simple statistics per productID, per day. So I think my code should first order the rows per day and then per product. Then it should compute the statistics and add them to the table.

此示例中的结果为(df2):

The result in this example is(df2):

|productID |productCategory |expiryDate |Price |Currency |quantitySold |daySold |volSTD |totalVol |totalRevenue|
------------------------------------------------------------------------**

|Sd23454 |sdfdsr |17mar2018 00:00:00 39  |USD |654  |31jan2017 12:45:17 |321.02 |854   |33306  |
|Fdgd4   |Ergdgf |15sep2020 00:00:00 125 |USD |300  |31jan2017 15:30:35 |0      |300   |37500  |
|Fdgd4   |Ergdgf |15sep2020 00:00:00 125 |USD |5675 |18feb2017 12:45:17 |840.04 |10162 |1270250|
|Rt4564  |fdgdf  |13jun2018 00:00:00 45  |USD |1544 |31feb2017 13:25:31 |0      |544   |69480  |
|Sd23454 |sdfdsr |17mar2018 00:00:00 39  |USD |7895 |31aug2017 15:30:35 |0      |7895  |307905 |

我在pandas中使用了一个嵌套的for循环,它给出了预期的结果,但是确实要花很长时间(几个小时). 我正在寻找一种获得此结果的快速方法.

I use a nested for loop in pandas which gives the expected result but it does take way to long (a couple of hours). I am looking for a fast way to get this result.

我的代码(可能是您见过的最糟糕的代码):

My code (probably on of the worst you’ve ever seen):

uniqueDays = df1.daySold.unique()
numberOfDays = df1.shape[0]
df_results = pd.Dataframe(columns=[‘productID’, ‘productCategory’, ‘expiryDate  Price’, ‘Currency’, ‘quantitySold’, ‘daySold’, ‘volSTD’, ‘totalVol’, ‘totalRevenue’])
For i in range(0, numberOfDays):
    temp1 = df1.loc[df1[‘daySold’]== uniqueDays[i]]
    uniqueID = temp1.productID.unique()
    NumberOfUniqueID = uniqueID.shape[0]
    for j in range(0, NumberOfUniqueID):
        temp2 = temp1.loc[temp1[‘daySold’]== uniqueID[j]
        volSTD = temp2.quantitySold.std()
        totalVol = temp2.quantitySold.sum()
        totalRevenue = temp2.quantitySold.dot(temp2.price)
        temp3 = temp2.iloc[0] # it does not matter which row I pick
        temp3[‘volSTD’] = volSTD
        temp3[‘totalVol’] = totalVol
        temp3[‘totalRevenue’] = totalRevenue
        df_results = df_results.append(temp3)

这给了我想要的结果,但是速度太慢了.特别是将列(volSTD,totalVol和totalRevenue)添加到temp3并将temp3附加到df_results总共需要81.3%的处理时间.

This gives me the results I want I want, but it is way too slow. In particular adding the columns (volSTD, totalVol and totalRevenue) to temp3 and appending temp3 to df_results take a combined 81.3% of the processing time.

有人能更快地做到这一点吗?使用向量?还是填充现有数据框而不是附加数据框?

Does anyone have a faster way to do this? Using vectors? Or filling an existing dataframe instead of appending?

谢谢

推荐答案

groupby怎么样?可以说,它比循环更有效地处理迭代,并且代码更短,可读性更好.您将在daySoldproductID上分组.显然,这是模拟数据,但是您想先将daySold转换为datetime对象,以便您可以轻松地对它进行分组-我只是保留一天的时间,但是可以根据需要保留时间:

How about groupby? It handles, so to speak, the iterations much more efficiently than loops and in much shorter and readable code. You would group on daySold and productID. This is obviously mock data, but you would want to turn your daySold into a datetime object first so you can easily group on it - I just kept the day, but you could keep the time if needed:

df.daySold=pd.to_datetime(df.daySold.apply(lambda x: x[:9]),format="%d%b%Y")

然后它只是一个衬套.使用groupby对象,您可以传递许多不同的聚合调用.

Then it is just a one-liner. With the groupby object you can pass a number of different aggregation calls.

df.groupby(['daySold','productID']).agg({'quantitySold':[sum,np.std],'Price':[sum,np.std]})

                     quantitySold              Price     
                              sum          std   sum  std
daySold    productID                                     
2017-01-31 Sd23454            854   321.026479    78  0.0
2017-02-13 Rt4564            1544          NaN    45  NaN
2017-02-18 Fdgd4             5975  3800.698949   250  0.0
2017-03-18 Fdgd4             4487          NaN   125  NaN
2017-08-30 Sd23454           7895          NaN    39  NaN

您可以使用groupby对象应用各种功能,包括现成的功能和您自己定义的功能.

You can use the groupby object to apply all manner of functions, off the shelf ones and ones you define yourself.

因此,您可以做一个点积,需要一个数据框的两个列/数组,如下所示:

So you could do a dot product, requiring two columns / arrays of a dataframe, like so:

def dotter(df):
    return np.sum(df.quantitySold*df.Price)
    ## or if you want to use numpy--may be faster for large datasets:
    #return np.dot(df.quantitySold,df.Price)

使用groupby对象的apply方法调用它:

Call it by using apply method of groupby object:

 df.groupby(['daySold','productID']).apply(dotter)

daySold     productID
2017-01-31  Sd23454       33306
2017-02-13  Rt4564        69480
2017-02-18  Fdgd4        746875
2017-03-18  Fdgd4        560875
2017-08-30  Sd23454      307905
dtype: int64

这篇关于排序和附加大型数据框的更快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆