一键编码多级列数据 [英] One-hot encoding multi-level column data

查看:75
本文介绍了一键编码多级列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框,其中包含具有不同主题特征的记录:

I have the following data frame where there are records with features about different subjects:

ID   Feature
-------------------------
1    A
1    B
2    A
1    A
3    B
3    B
1    C
2    C
3    D

我想获得另一个(汇总的?)数据帧,其中每一行代表一个特定的主题,并且列出了所有一个热门编码特征的详尽列表:

I'd like to get another (aggregated?) data frame where each row represents a specific subject, and there are an exhaustive list of all one-hot encoded features:

ID   FEATURE_A FEATURE_B FEATURE_C FEATURE D
--------------------------------------------
1    1         1         1         0
2    1         0         1         0
3    0         1         0         0

如何在Python(Pandas)中实现?

How could it be implemented in Python (Pandas)?

奖金:如何实现一个功能列包含事件编号而不仅仅是二进制标志的版本?

Bonus: how could be implemented a version where the feature columns contain occurence numbers, not just binary flags?

推荐答案

使用 join get_dummies ,然后 groupby 并汇总max:

df =df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max()
print (df)
    FEATURE_A  FEATURE_B  FEATURE_C  FEATURE_D
ID                                            
1           1          1          1          0
2           1          0          1          0
3           0          1          0          1

详细信息:

print (pd.get_dummies(df['Feature']))
   A  B  C  D
0  1  0  0  0
1  0  1  0  0
2  1  0  0  0
3  1  0  0  0
4  0  1  0  0
5  0  1  0  0
6  0  0  1  0
7  0  0  1  0
8  0  0  0  1

使用 MultiLabelBinarizer 构造函数:

Another solution with MultiLabelBinarizer and DataFrame constructor:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Feature']),
                   columns=['FEATURE_' + x for x in mlb.classes_], 
                   index=df.ID).max(level=0)
print (df1)
    FEATURE_A  FEATURE_B  FEATURE_C  FEATURE_D
ID                                            
1           1          1          1          0
2           1          0          1          0
3           0          1          0          1

时间:

np.random.seed(123)
N = 100000
L = list('abcdefghijklmno'.upper()) 
df = pd.DataFrame({'Feature': np.random.choice(L, N),
                   'ID':np.random.randint(10000,size=N)})

def jez(df):
    mlb = MultiLabelBinarizer()
    return pd.DataFrame(mlb.fit_transform(df['Feature']),
                   columns=['FEATURE_' + x for x in mlb.classes_], 
                   index=df.ID).max(level=0)


#jez1
In [464]: %timeit (df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max())
10 loops, best of 3: 39.3 ms per loop

In [465]: %timeit (jez(df))
10 loops, best of 3: 138 ms per loop

#Scott Boston1
In [466]: %timeit (df.set_index('ID')['Feature'].str.get_dummies().add_prefix('FEATURE_').max(level=0))
1 loop, best of 3: 1.03 s per loop

#wen1
In [467]: %timeit (pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE '))
1 loop, best of 3: 383 ms per loop

#wen2
In [468]: %timeit (pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0))
10 loops, best of 3: 47 ms per loop

注意事项

鉴于FeatureID的比例,结果并未解决性能问题,这对于其中某些解决方案的时序会产生很大影响.

Caveat

The results do not address performance given the proportion of Feature and ID, which will affect timings a lot for some of these solutions.

这篇关于一键编码多级列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆