具有多种功能实例的多功能一键编码器 [英] Multi-Feature One-Hot-Encoder with varying amount of feature instances

查看:54
本文介绍了具有多种功能实例的多功能一键编码器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有这样的数据实例:

Let's assume we have data instances like this:

[
    [15, 20, ("banana","apple","cucumber"), ...],
    [91, 12, ("orange","banana"), ...],
    ...
]

我想知道如何编码这些数据点的第三个元素.对于多个功能值,我们可以使用sklearn的 OneHotEncoder ,但据我所知,它不能处理不同长度的输入.

I am wondering how I can encode the third element of these datapoints. For multiple features values we could use sklearn's OneHotEncoder, but as far as I could find out, it cannot handle inputs of different length.

这是我尝试过的:

X = [[15, 20, ("banana","apple","cucumber")], [91, 12, ("orange","banana")]]

ct = ColumnTransformer(
    [
        ("genre_encoder", OneHotEncoder(), [2])
    ],
    remainder='passthrough'
)
print(ct.fit_transform(X))

这只会输出

[[1.0 0.0 15 20]
 [0.0 1.0 91 12]]

如预期的那样,因为元组被作为可表示该功能的可能值使用.

as expected, because the tuples are handled as the possible values this feature can be represented with.

我们无法直接嵌入我们的功能(例如 [15、12,香蕉",苹果",黄瓜"] ),因为

We can't embed our features directly (like [15, 12, "banana", "apple", "cucumber"]), because

  1. 我们不知道我们将拥有多少个该功能实例(两个?三个?)
  2. 每个位置都将被解释为自己的功能,因此,如果我们在一个数据点的第一个标称插槽中有 banana ,而在我们的第二个标称插槽中的第二个标称插槽中有 banana ,则它们不会计入相同的价值池"功能可以体现
  1. we don't know how many instances of this feature we will have (two? three?)
  2. each position would be interpreted as an own feature and thus if we had banana in the first nominal slot in one datapoint and in the second one in our second nominal slot, they would not count to the same "pool of values" a feature can embody

示例:

X = [["banana","apple","cucumber"], ["orange","banana", "cucumber"]]
enc = OneHotEncoder()
print(enc.fit_transform(X).toarray())

[[1. 0. 1. 0. 1.]
 [0. 1. 0. 1. 1.]]

此表示形式包含5个插槽,而不是4个插槽,因为第一个插槽被解释为使用 banana orange ,第二个插槽被解释为 apple 香蕉,最后一个只有选项黄瓜.

This representation contains 5 slots instead of 4, because the first slot is interpreted as using banana or orange, the second one as apple or banana and the last one only has the option cucumber.

(这也不会解决每个数据点具有不同数量的特征值的问题.用 None 替换空的特征值也不能解决问题,因为然后是 None 会遇到这种位置歧义.)

(This would also not solve the problem of having different amounts of feature values per datapoint. And replacing empty ones with None does not solve the problem either, because then None faces this positional ambiguity.)

有什么想法如何对那些可以采用多个值并由不同数量的元素组成的"Multi-Muliti"特征进行编码?预先谢谢你!

Any idea how to encode those "Multi-Muliti-"features, that can take multiple values and consist of a varying amount of elements? Thank you in advance!

推荐答案

我现在通过将其转换为CountVectorizer问题来解决它,这要归功于David Maspis的回答

I solved it for now by transforming it into a CountVectorizer Problem, thanks to David Maspis answer on the datascience stackexchange.

这篇关于具有多种功能实例的多功能一键编码器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆