一键编码多级列数据 [英] One-hot encoding multi-level column data
问题描述
我有以下数据框,其中包含具有不同主题特征的记录:
I have the following data frame where there are records with features about different subjects:
ID Feature
-------------------------
1 A
1 B
2 A
1 A
3 B
3 B
1 C
2 C
3 D
我想获得另一个(汇总的?)数据帧,其中每一行代表一个特定的主题,并且列出了所有一个热门编码特征的详尽列表:
I'd like to get another (aggregated?) data frame where each row represents a specific subject, and there are an exhaustive list of all one-hot encoded features:
ID FEATURE_A FEATURE_B FEATURE_C FEATURE D
--------------------------------------------
1 1 1 1 0
2 1 0 1 0
3 0 1 0 0
如何在Python(Pandas)中实现?
How could it be implemented in Python (Pandas)?
奖金:如何实现一个功能列包含事件编号而不仅仅是二进制标志的版本?
Bonus: how could be implemented a version where the feature columns contain occurence numbers, not just binary flags?
推荐答案
使用 join
与 get_dummies
,然后 groupby
并汇总max
:
df =df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max()
print (df)
FEATURE_A FEATURE_B FEATURE_C FEATURE_D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
详细信息:
print (pd.get_dummies(df['Feature']))
A B C D
0 1 0 0 0
1 0 1 0 0
2 1 0 0 0
3 1 0 0 0
4 0 1 0 0
5 0 1 0 0
6 0 0 1 0
7 0 0 1 0
8 0 0 0 1
使用 MultiLabelBinarizer 和
Another solution with MultiLabelBinarizer and DataFrame
constructor:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Feature']),
columns=['FEATURE_' + x for x in mlb.classes_],
index=df.ID).max(level=0)
print (df1)
FEATURE_A FEATURE_B FEATURE_C FEATURE_D
ID
1 1 1 1 0
2 1 0 1 0
3 0 1 0 1
时间:
np.random.seed(123)
N = 100000
L = list('abcdefghijklmno'.upper())
df = pd.DataFrame({'Feature': np.random.choice(L, N),
'ID':np.random.randint(10000,size=N)})
def jez(df):
mlb = MultiLabelBinarizer()
return pd.DataFrame(mlb.fit_transform(df['Feature']),
columns=['FEATURE_' + x for x in mlb.classes_],
index=df.ID).max(level=0)
#jez1
In [464]: %timeit (df[['ID']].join(pd.get_dummies(df['Feature']).add_prefix('FEATURE_')).groupby('ID').max())
10 loops, best of 3: 39.3 ms per loop
In [465]: %timeit (jez(df))
10 loops, best of 3: 138 ms per loop
#Scott Boston1
In [466]: %timeit (df.set_index('ID')['Feature'].str.get_dummies().add_prefix('FEATURE_').max(level=0))
1 loop, best of 3: 1.03 s per loop
#wen1
In [467]: %timeit (pd.crosstab(df.ID,df.Feature).gt(0).astype(int).add_prefix('FEATURE '))
1 loop, best of 3: 383 ms per loop
#wen2
In [468]: %timeit (pd.get_dummies(df.drop_duplicates().set_index('ID')).sum(level=0))
10 loops, best of 3: 47 ms per loop
注意事项
鉴于Feature
和ID
的比例,结果并未解决性能问题,这对于其中某些解决方案的时序会产生很大影响.
Caveat
The results do not address performance given the proportion ofFeature
and ID
, which will affect timings a lot for some of these solutions.
这篇关于一键编码多级列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!