如何一种热编码变体长度特征? [英] How to one hot encode variant length features?
问题描述
给定一个变体长度特征列表:
Given a list of variant length features:
features = [
['f1', 'f2', 'f3'],
['f2', 'f4', 'f5', 'f6'],
['f1', 'f2']
]
其中每个样本都有特征的变体数量,特征 dtype
是 str
并且已经很热了.
where each sample has variant number of features and the feature dtype
is str
and already one hot.
为了使用 sklearn 的特征选择实用程序,我必须将 features
转换为一个二维数组,如下所示:
In order to use feature selection utilities of sklearn, I have to convert the features
to a 2D-array which looks like:
f1 f2 f3 f4 f5 f6
s1 1 1 1 0 0 0
s2 0 1 0 1 1 1
s3 1 1 0 0 0 0
我如何通过 sklearn 或 numpy 实现它?
How could I achieve it via sklearn or numpy?
推荐答案
您可以使用 MultiLabelBinarizer 存在于 scikit 中,专门用于执行此操作.
You can use MultiLabelBinarizer present in scikit which is specifically used for doing this.
示例代码:
features = [
['f1', 'f2', 'f3'],
['f2', 'f4', 'f5', 'f6'],
['f1', 'f2']
]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
new_features = mlb.fit_transform(features)
输出:
array([[1, 1, 1, 0, 0, 0],
[0, 1, 0, 1, 1, 1],
[1, 1, 0, 0, 0, 0]])
这也可以与其他 feature_selection 实用程序一起用于管道中.
This can also be used in a pipeline, along with other feature_selection utilities.
这篇关于如何一种热编码变体长度特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!