生成新列作为其他列的组合 [英] Generate new columns as a combination of other columns

查看:65
本文介绍了生成新列作为其他列的组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个DataFrame,它在列中具有标识符的几个组成部分,在另一列中具有与标识符关联的值.我希望能够创建n列,使每一列都是识别参数.

I have a DataFrame that has several components of an identifier in the columns and a value associated with the identifier in another column. I want to be able to create n columns such that each of the column is the identifying parameter.

            foo    Type      ID     Index    Value  
25090        x       A        0       0     23272000
25090        x       A        0       0     23272000
25091        x       A        1       0     22896000
25092        x       B        0       1     20048000
25093        y       A        0       0     19760000
25092        y       B        0       1     20823342

我要使它具有n个Type_ID_Index分组列(我可以通过groupby获得),并且每个列都有各自的值.我希望该值与foo关联.

I want to make it such that there are n columns of Type_ID_Index groupings (which I can get by groupby) and each of the columns has the respective value. I want the value to be associated with foo.

           foo          A_0_0     A_1_0     B_0_1      
25090        x         23272000  22896000  20048000
25090        x         23272000  22896000  20048000
25091        x         23272000  22896000  20048000
25092        x         23272000  22896000  20048000
25093        y         19760000  21568000  20823342
25092        y         19760000  21568000  20823342

我如何做到这一点?

推荐答案

从示例数据开始

In [3]: df
Out[3]: 
      foo  bar Type  ID  Index     Value

25090   x    9    A   0      0  23272000
25090   x    5    A   0      0  23272000
25091   x    3    A   1      0  22896000
25092   x    3    B   0      1  20048000
25093   y    6    A   0      0  19760000
25092   y    4    B   0      1  20823342

通过逐行应用join来连接每行的标识符.

Concatenate each row's identifer by applying join row-wise.

In [4]: identifier = df[['Type', 'ID', 'Index']].apply(
             lambda x: '_'.join(map(str, x)), axis=1)

从值"列中创建一个系列,并通过identifer和foo对其进行索引.

Make a Series from your Value column, and index it by the identifer and foo.

In [5]: v = df['Value']

In [6]: v.index = pd.MultiIndex.from_arrays([df['foo'], identifier])

In [7]: v
Out[7]: 
foo       
x    A_0_0    23272000
     A_0_0    23272000
     A_1_0    22896000
     B_0_1    20048000
y    A_0_0    19760000
     B_0_1    20823342
Name: Value, dtype: int64

将其拆栈,然后将其加入到'foo'上的原始DataFrame中.

Unstack it, and join it to the original DataFrame on 'foo'.

In [8]: df[['foo', 'bar']].join(v.drop_duplicates().unstack(), on='foo')
Out[8]: 
      foo  bar     A_0_0     A_1_0     B_0_1

25090   x    9  23272000  22896000  20048000
25090   x    5  23272000  22896000  20048000
25091   x    3  23272000  22896000  20048000
25092   x    3  23272000  22896000  20048000
25093   y    6  19760000       NaN  20823342
25092   y    4  19760000       NaN  20823342

请注意,我在将副本放置在v中之前先将其堆叠了.这是必不可少的.如果您在数据集中的任何地方对同一个意识形态使用不同的值,则会遇到麻烦.

Notice that I dropped the duplicates in v before unstacking it. This is essential. If you have different values for the same idenitifer anywhere in your dataset, you will run into trouble.

次要点:示例输出中的一行(25094)在示例输入中缺失.另外,我的输出中的NaN很有意义:当foo ='y'时,A_1_0未指定任何值.

Minor points: Your example output has a row (25094) that is missing from your example input. Also, the NaNs in my output make sense: no value is specified by A_1_0 when foo='y'.

这篇关于生成新列作为其他列的组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆