使用另一个数据框在数据框中创建子列 [英] Create a sub columns in the dataframe using a another dataframe

查看:90
本文介绍了使用另一个数据框在数据框中创建子列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python和pandas的新手.在这里,我有一个以下数据框.

I am new to the python and pandas. Here, I have a following dataframe .

did           features   offset   word   JAPE_feature  manual_feature 
0             200         0        aa      200          200 
0             200         11       bf      200          200
0             200         12       vf      100          100
0             100         13       rw      2200         2200
0             100         14       asd     2600         100 
0             2200        16       dsdd    2200         2200
0             2600        18       wd      2200         2600 
0             2600        20       wsw     2600         2600 
0             4600        21        sd     4600         4600

现在,我有一个数组,其中包含可以为该ID显示的所有特征值.

Now , I have an array which has all the feature values which can appear for that id.

feat = [100,200,2200,2600,156,162,4600,100]

现在,我正在尝试创建一个看起来像这样的数据框,

Now, I am trying to create a dataframe whic will look like,

id                    Features 
           100   200   2200   2600  156   162    4600  100
0           0     1      0     0     0     0      0     0
1           0     1      0     0     0     0      0     0
2           0     1      0     0     0     0      0     0
3           0     1      0     0     0     0      0     0
4           1     0      0     0     0     0      0     0
5           1     0      0     0     0     0      0     0
7           0     0      1     0     0     0      0     0
8           0     0      0     1     0     0      0     0
9           0     0      0     1     0     0      0     0
10          0     0      0     0     0     0      1     0

所以,在进行比较时,

feature_manual
     1 
     1  
     0 
     0
     1
     1
     1
     1
     1

Here compairing the features and the manual_feature columns. if values are same then 1 or else 0. so 200 and 200 for 0 is same in both so 1 

因此,这是预期的输出.在这里,我正在尝试在新的csv中为该功能添加值1,并为其他0添加值.

So, this is the expected output. Here I am trying to add the value 1 for that feature in the new csv and for other 0.

So, it is by row by row.

因此,如果我们在第一行中检查该特征为200,则200处为1,其他为0.

So, If we check in the first row the feature is 200 so there is 1 at 200 and others are 0.

有人可以帮助我吗?

我尝试过的是

mux = pd.MultiIndex.from_product([['features'],feat)
df = pd.DataFrame(data, columns=mux)

SO,此处创建子列,但删除所有其他值.有人可以帮我吗?

SO, Here creatig subcolumns but removing all other values . can any one help me ?

推荐答案

使用 get_dummies 如果需要MultiIndex,则仅将mux传递给reindex,还将id列转换为index:

If need MultiIndex only pass mux to reindex, but also convert id column to index:

feat = [100,200,2200,2600,156,162,4600,100]
mux = pd.MultiIndex.from_product([['features'],feat])

df = pd.get_dummies(df.set_index('id')['features']).reindex(mux, axis=1, fill_value=0)
print (df)
   features                                   
       100  200  2200 2600 156  162  4600 100 
id                                            
0         0    0    0    0    0    0    0    0
1         0    0    0    0    0    0    0    0
2         0    0    0    0    0    0    0    0
4         0    0    0    0    0    0    0    0
5         0    0    0    0    0    0    0    0
7         0    0    0    0    0    0    0    0
8         0    0    0    0    0    0    0    0
9         0    0    0    0    0    0    0    0
10        0    0    0    0    0    0    0    0

cols = ['features', 'JAPE_feature', 'manual_feature']

df = pd.get_dummies(df, columns=cols)
df.columns = df.columns.str.rsplit('_',1, expand=True)
print (df)
  did offset  word features                    JAPE_feature                \
  NaN    NaN   NaN      100 200 2200 2600 4600          100 200 2200 2600   
0   0      0    aa        0   1    0    0    0            0   1    0    0   
1   0     11    bf        0   1    0    0    0            0   1    0    0   
2   0     12    vf        0   1    0    0    0            1   0    0    0   
3   0     13    rw        1   0    0    0    0            0   0    1    0   
4   0     14   asd        1   0    0    0    0            0   0    0    1   
5   0     16  dsdd        0   0    1    0    0            0   0    1    0   
6   0     18    wd        0   0    0    1    0            0   0    1    0   
7   0     20   wsw        0   0    0    1    0            0   0    0    1   
8   0     21    sd        0   0    0    0    1            0   0    0    0   

       manual_feature                     
  4600            100 200 2200 2600 4600  
0    0              0   1    0    0    0  
1    0              0   1    0    0    0  
2    0              1   0    0    0    0  
3    0              0   0    1    0    0  
4    0              1   0    0    0    0  
5    0              0   0    1    0    0  
6    0              0   0    0    1    0  
7    0              0   0    0    1    0  
8    1              0   0    0    0    1  

如果要避免没有MultiIndex的列的列中MultIndex的值丢失:

If want avoid missing values in MultIndex in columns for columns with no MultiIndex:

cols = ['features', 'JAPE_feature', 'manual_feature']
df = df.set_index(df.columns.difference(cols).tolist())

df = pd.get_dummies(df, columns=cols)
df.columns = df.columns.str.rsplit('_',1, expand=True)
print (df)
                features                    JAPE_feature                     \
                     100 200 2200 2600 4600          100 200 2200 2600 4600   
did offset word                                                               
0   0      aa          0   1    0    0    0            0   1    0    0    0   
    11     bf          0   1    0    0    0            0   1    0    0    0   
    12     vf          0   1    0    0    0            1   0    0    0    0   
    13     rw          1   0    0    0    0            0   0    1    0    0   
    14     asd         1   0    0    0    0            0   0    0    1    0   
    16     dsdd        0   0    1    0    0            0   0    1    0    0   
    18     wd          0   0    0    1    0            0   0    1    0    0   
    20     wsw         0   0    0    1    0            0   0    0    1    0   
    21     sd          0   0    0    0    1            0   0    0    0    1   

                manual_feature                     
                           100 200 2200 2600 4600  
did offset word                                    
0   0      aa                0   1    0    0    0  
    11     bf                0   1    0    0    0  
    12     vf                1   0    0    0    0  
    13     rw                0   0    1    0    0  
    14     asd               1   0    0    0    0  
    16     dsdd              0   0    1    0    0  
    18     wd                0   0    0    1    0  
    20     wsw               0   0    0    1    0  
    21     sd                0   0    0    0    1 

如果要通过manual_feature列比较列表中的某些列,请使用 DataFrame.eq 转换为整数:

If want compare some column from list by manual_feature column use DataFrame.eq with converting to integers:

cols = ['JAPE_feature', 'features']
df1 = df[cols].eq(df['manual_feature'], axis=0).astype(int)
print (df1)
   JAPE_feature  features
0             1         1
1             1         1
2             1         0
3             1         0
4             0         1
5             1         1
6             0         1
7             1         1
8             1         1 

这篇关于使用另一个数据框在数据框中创建子列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆