在Python Pandas Dataframe中动态添加列的数据处理 [英] Data processing with adding columns dynamically in Python Pandas Dataframe

查看:138
本文介绍了在Python Pandas Dataframe中动态添加列的数据处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题. 可以说这是我的CSV

I have the following problem. Lets say this is my CSV

id f1 f2 f3
1  4  5  5
1  3  1  0
1  7  4  4
1  4  3  1
1  1  4  6
2  2  6  0
..........

因此,我有可以按ID分组的行. 我想创建如下的csv作为输出.

So, I have rows which can be grouped by id. I want to create a csv like below as an output.

f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t
4  5  5   3   1    0    7      4      4      1   4     6  

因此,我希望能够选择要转换成列的行数(始终从id的第一行开始).在这种情况下,我抓了3行. 然后,我还将跳过一个或多个行(在这种情况下,仅跳过一个),以从同一id组的最后一行获取最后一列.出于某种原因,我想使用一个数据框.

So, I want to be able to chose the number of rows I will grab to convert into columns (always starting from the first row of an id). In this case I grabbed 3 rows. I will also then skip one or more rows (in this case only one skip) to take the final columns from the last row of the same id group. And for reasons, I want to use a data frame.

经过3-4小时的奋斗.我找到了下面给出的解决方案. 但是我的解决方案很慢.我大约有700,000行,可能有大约70,000组ID.在我的4GB 4核心Lenovo上,model = 3处的上述代码将花费近一个小时.我需要进入模型=也许是10或15.我仍然是Python的新手,并且我相信可以进行一些更改来加快速度.有人可以深入解释我如何改进代码.

After struggling for a 3-4 hours. I found out a solution as given below. But my solution is very slow. I have about 700,000 rows and may be around 70,000 groups of ids. The code above at model=3 takes almost an hour on my 4GB 4 Core Lenovo. I need to go to model = maybe 10 or 15. I am still novice in Python and I am sure there can be several changes that will make this fast. Can some one explain deeply how I can improve on the code.

一吨.

model:要抓取的行数

model : number of rows to grab

# train data frame from reading the csv
train = pd.read_csv(filename)

# Get groups of rows with same id
csv_by_id = train.groupby('id')

modelTarget = { 'f1_t','f2_t','f3_t'}

# modelFeatures is a list of features I am interested in the csv. 
    # The csv actually has hundreds
modelFeatures = { 'f1, 'f2' , 'f3' }

coreFeatures = list(modelFeatures) # cloning 


selectedFeatures = list(modelFeatures) # cloning

newFeatures = list(selectedFeatures) # cloning

finalFeatures = list(selectedFeatures) # cloning

# Now create the column list depending on the number of rows I will grab from
for x in range(2,model+1):
    newFeatures = [s + '_n' for s in newFeatures]
    finalFeatures = finalFeatures + newFeatures

# This is the final column list for my one row in the final data frame
selectedFeatures = finalFeatures + list(modelTarget) 

# Empty dataframe which I want to populate
model_data = pd.DataFrame(columns=selectedFeatures)

for id_group in csv_by_id:
    #id_group is a tuple with first element as the id itself and second one a dataframe with the rows of a group
    group_data = id_group[1] 

    #hmm - can this be better? I am picking up the rows which I need from first row on wards
    df = group_data[coreFeatures][0:model] 

    # initialize a list
    tmp = [] 

    # now keep adding the column values into the list
    for index, row in df.iterrows(): 
        tmp = tmp + list(row)


    # Wow, this one below surely should have something better. 
    # So i am picking up the feature column values from the last row of the group of rows for a particular id 
    targetValues = group_data[list({'f1','f2','f3'})][len(group_data.index)-1:len(group_data.index)].values 

    # Think this can be done easier too ? . Basically adding the values to the tmp list again
    tmp = tmp + list(targetValues.flatten()) 

    # coverting the list to a dict.
    tmpDict = dict(zip(selectedFeatures,tmp))  

    # then the dict to a dataframe.
    tmpDf = pd.DataFrame(tmpDict,index={1}) 

    # I just could not find a better way of adding a dict or list directly into a dataframe. 
    # And I went through lots and lots of blogs on this topic, including some in StackOverflow.

    # finally I add the frame to my main frame
    model_data = model_data.append(tmpDf) 

# and write it
model_data.to_csv(wd+'model_data' + str(model) + '.csv',index=False) 

推荐答案

Groupby 是你的朋友.

这将很好地扩展;特征数量中只有很小的常数.大约是O(组数)

This will scale very well; only a small constant in the number of features. It will be roughly O(number of groups)

In [28]: features = ['f1','f2','f3']

创建一些测试数据,小组人数为7-12,每小组70k

Create some test data, group sizes are 7-12, 70k groups

In [29]: def create_df(i):
   ....:     l = np.random.randint(7,12)
   ....:     df = DataFrame(dict([ (f,np.arange(l)) for f in features ]))
   ....:     df['A'] = i
   ....:     return df
   ....: 

In [30]: df = concat([ create_df(i) for i in xrange(70000) ])

In [39]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 629885 entries, 0 to 9
Data columns (total 4 columns):
f1    629885 non-null int64
f2    629885 non-null int64
f3    629885 non-null int64
A     629885 non-null int64
dtypes: int64(4)

创建一个框架,从每个组中选择前3行和最后一行(请注意,这将处理尺寸小于4的组,但是您的最后一行可能与另一行重叠,您可能希望执行groupby.filter对此进行补救)

Create a frame where you select the first 3 rows and the final row from each group (note that this WILL handle groups of size < 4, however your final row may overlap another, you may wish to do a groupby.filter to remedy this)

In [31]: groups = concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()

# This step is necesary in pandas < master/0.14 as the returned fields 
# will include the grouping field (the A), (is a bug/API issue)
In [33]: groups = groups[features]

In [34]: groups.head(20)
Out[34]: 
     f1  f2  f3
A              
0 0   0   0   0
  1   1   1   1
  2   2   2   2
  7   7   7   7
1 0   0   0   0
  1   1   1   1
  2   2   2   2
  9   9   9   9
2 0   0   0   0
  1   1   1   1
  2   2   2   2
  8   8   8   8
3 0   0   0   0
  1   1   1   1
  2   2   2   2
  8   8   8   8
4 0   0   0   0
  1   1   1   1
  2   2   2   2
  9   9   9   9

[20 rows x 3 columns]

In [38]: groups.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 280000 entries, (0, 0) to (69999, 9)
Data columns (total 3 columns):
f1    280000 non-null int64
f2    280000 non-null int64
f3    280000 non-null int64
dtypes: int64(3)

非常快

In [32]: %timeit concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
1 loops, best of 3: 1.16 s per loop

要进行进一步的操作,通常应在此处停止并使用它(因为它以易于处理的很好的分组格式).

For further manipulation you usually should stop here and use this (as its in a nice grouped format that's easy to deal with).

如果您想将其翻译成宽格式

If you want to translate this to a wide format

In [35]: dfg = groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))

In [36]: %timeit groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
dfg.head()
groups.info()
1 loops, best of 3: 14.5 s per loop
In [40]: dfg.columns = [ "{0}_{1}".format(f,i) for i in range(1,5) for f in features ]

In [41]: dfg.head()
Out[41]: 
   f1_1  f2_1  f3_1  f1_2  f2_2  f3_2  f1_3  f2_3  f3_3  f1_4  f2_4  f3_4
A                                                                        
0     0     0     0     1     1     1     2     2     2     7     7     7
1     0     0     0     1     1     1     2     2     2     9     9     9
2     0     0     0     1     1     1     2     2     2     8     8     8
3     0     0     0     1     1     1     2     2     2     8     8     8
4     0     0     0     1     1     1     2     2     2     9     9     9

[5 rows x 12 columns]

In [42]: dfg.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 70000 entries, 0 to 69999
Data columns (total 12 columns):
f1_1    70000 non-null int64
f2_1    70000 non-null int64
f3_1    70000 non-null int64
f1_2    70000 non-null int64
f2_2    70000 non-null int64
f3_2    70000 non-null int64
f1_3    70000 non-null int64
f2_3    70000 non-null int64
f3_3    70000 non-null int64
f1_4    70000 non-null int64
f2_4    70000 non-null int64
f3_4    70000 non-null int64
dtypes: int64(12)

这篇关于在Python Pandas Dataframe中动态添加列的数据处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆