高效使用Numpy在行块中进行处理 [英] Efficient use of Numpy to process in blocks of rows
问题描述
我需要遍历一组唯一帐户(在下面的示例代码中为AccountID),并为每个唯一帐户ID计算功能选择(当前仅以TargetCol为例).实际上,我正在读取csv文件作为Pandas数据帧(1M行),然后转换为Numpy记录数组,以便仍可以在循环中引用标头名称.我采用的方法是为每个唯一的AccountID创建一个切片,为每个切片计算TargetCol,然后将这些切片重新连接在一起.
I need to iterate over a set of unique accounts (AccountID in the example code below) and calculate a selection of features for each unique AccountID (currently just showing TargetCol as an example). In reality, I am reading in a csv file as a Pandas dataframe (1M rows) and then converting to a Numpy records array so that I can still refer to the header names in the loops. The way I have approached this is to create a slice for each unique AccountID, calculate TargetCol for each slice and then concatenate the slices back together.
我下面的代码可以正常工作,但是我很确定它可以以一种更有效的方式来完成(高效意味着减少处理时间).
The code I have below works ok but I am pretty sure it can be done in a much more efficient way (by efficient I mean reduced processing time).
%%time
import pandas as pd
import numpy as np
from numpy.random import randn
x=300 #make x higher to test more records
df = pd.DataFrame(randn(x,3),columns=['AccountID','Bcol','Ccol'])
for m,row in df.iterrows():
df.loc[m,'AccountID'] = np.random.randint(int(x/10))
df.loc[m,'Bcol'] = np.int(np.random.uniform(low=0.0, high=1000.0, size=None))/10000
df.loc[m,'Ccol'] = np.int(np.random.uniform(low=0.0, high=1000.0, size=None))/10000
df['TargetCol']=np.nan
dfnum = df.to_records(index=False)
dfnum = np.sort(dfnum, order=['AccountID'])
pd.DataFrame(dfnum)
uniquelist = np.unique(dfnum['AccountID'])
for u in range(0,len(uniquelist)):
dfslice = dfnum[dfnum['AccountID'] == uniquelist[u]]
for i in range(0,len(dfslice)):
if (len(dfslice) - i) >= 6:
dfslice['TargetCol'][i] = np.nansum(dfslice['Bcol'][i:i+6]) / dfslice['Ccol'][i]
else:
dfslice['TargetCol'][i] = np.NaN
if u==0:
dfconcat = dfslice
else:
dfconcat = np.concatenate((dfconcat, dfslice),axis=0)
pd.DataFrame(dfconcat)
推荐答案
IIUC我认为您需要:
IIUC I think you need:
import pandas as pd
df = pd.DataFrame({'AccountID': [1, 1, 1, 2, 1, 2, 1, 2, 2],
'RefDay': [1, 2, 3, 1, 4, 2, 5, 3, 4],
'BCol': [1., 2., np.nan, 1., 3., 2., 1., np.nan, 2.] ,
'CCol': [3., 2., 3., 1., 3., 4., 5., 2., 1.] })
df = df.sort_values(by=['AccountID','RefDay']).reset_index(drop=True)
# Replace with 6 in real data
periods = 3
result = df.groupby('AccountID').apply(lambda g: g['BCol'].fillna(0).rolling(periods).sum().shift(-periods + 1) / g['CCol'])
df['TargetColumn'] = result.sortlevel(1).values
print(df)
输出:
AccountID BCol CCol RefDay TargetColumn
0 1 1.0 3.0 1 1.000000
1 1 2.0 2.0 2 2.500000
2 1 NaN 3.0 3 1.333333
3 1 3.0 3.0 4 NaN
4 1 1.0 5.0 5 NaN
5 2 1.0 1.0 1 3.000000
6 2 2.0 4.0 2 1.000000
7 2 NaN 2.0 3 NaN
8 2 2.0 1.0 4 NaN
这篇关于高效使用Numpy在行块中进行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!