pyspark OneHotEncoded 向量似乎缺少类别? [英] pyspark OneHotEncoded vectors appear to be missing categories?

查看:32
本文介绍了pyspark OneHotEncoded 向量似乎缺少类别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在尝试使用 pyspark 的 OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) 看起来像是 onehot向量缺少某些类别(或者显示时格式可能很奇怪?).

Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?).

现在回答这个问题(或提供一个答案)后,似乎下面的细节与理解问题并不完全相关

After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem

有表单的数据集

1. Wife's age                     (numerical)
2. Wife's education               (categorical)      1=low, 2, 3, 4=high
3. Husband's education            (categorical)      1=low, 2, 3, 4=high
4. Number of children ever born   (numerical)
5. Wife's religion                (binary)           0=Non-Islam, 1=Islam
6. Wife's now working?            (binary)           0=Yes, 1=No
7. Husband's occupation           (categorical)      1, 2, 3, 4
8. Standard-of-living index       (categorical)      1=low, 2, 3, 4=high
9. Media exposure                 (binary)           0=Good, 1=Not good
10. Contraceptive method used     (class attribute)  1=No-use, 2=Long-term, 3=Short-term  

实际数据看起来像

wife_age,wife_edu,husband_edu,num_children,wife_religion,wife_working,husband_occupation,SoL_index,media_exposure,contraceptive
24,2,3,3,1,1,2,3,0,1
45,1,3,10,1,1,3,4,0,1

来源:https://archive.ics.uci.edu/ml/datasets/避孕+方法+选择.

在对数据进行一些其他预处理之后,然后尝试通过...将分类和二进制(仅用于练习)特征编码为 1hot 向量.

After doing some other preprocessing on the data, then trying to encode the categorical and binary (just for the sake of practice) features to 1hot vectors via...

for inds in ['wife_edu', 'husband_edu', 'husband_occupation', 'SoL_index', 'wife_religion', 'wife_working', 'media_exposure', 'contraceptive']:
    encoder = OneHotEncoder(inputCol=inds, outputCol='%s_1hot' % inds)
    print encoder.k
    dataset = encoder.transform(dataset)

产生一行看起来像

Row(
    ...., 
    numeric_features=DenseVector([24.0, 3.0]), numeric_features_normalized=DenseVector([-1.0378, -0.1108]), 
    wife_edu_1hot=SparseVector(4, {2: 1.0}), 
    husband_edu_1hot=SparseVector(4, {3: 1.0}), 
    husband_occupation_1hot=SparseVector(4, {2: 1.0}), 
    SoL_index_1hot=SparseVector(4, {3: 1.0}), 
    wife_religion_1hot=SparseVector(1, {0: 1.0}),
    wife_working_1hot=SparseVector(1, {0: 1.0}),
    media_exposure_1hot=SparseVector(1, {0: 1.0}),
    contraceptive_1hot=SparseVector(2, {0: 1.0})
)

我对稀疏向量格式的理解是 SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn}) 暗示了一个长度为 S 的向量其中所有值为 0 期望索引 i1,...,其中具有相应的值 v1,...,vn (https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html).

My understanding of sparse vector format is that SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn}) implies a vector of length S where all values are 0 expect for indices i1,...,in which have corresponding values v1,...,vn (https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html).

基于此,看起来这种中的 SparseVector 实际上表示向量中的最高索引(而不是大小).此外,结合所有特征(通过 pyspark 的 VectorAssembler)并检查结果 dataset.head(n=1) 向量的数组版本显示

Based on this, it seems like the SparseVector in this case actually denotes the highest index in the vector (not the size). Furthermore, combining all the features (via pyspark's VectorAssembler) and checking the array version of the resulting dataset.head(n=1) vector shows

input_features=SparseVector(23, {0: -1.0378, 1: -0.1108, 4: 1.0, 9: 1.0, 12: 1.0, 17: 1.0, 18: 1.0, 19: 1.0, 20: 1.0, 21: 1.0})

indicates a vector looking like

indices:  0        1       2  3  4...           9        12             17 18 19 20 21
        [-1.0378, -0.1108, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]

我认为应该不可能有 >= 3 个连续 1 的序列(如上面向量的尾部附近所见),因为这表明 onehot 向量之一(例如中间的1) 只有大小为 1,这对任何数据特征都没有意义.

I would think that it should be impossible to have a sequence of >= 3 consecutive 1s (as can be seen near the tail of the vector above), as this would indicate that one of the onehot vectors (eg. the middle 1) is only of size 1, which would not make sense for any of the data features.

对机器学习非常陌生,所以可能会对这里的一些基本概念感到困惑,但有人知道这里会发生什么吗?

Very new to machine learning stuff, so may be confused about some basic concepts here, but does anyone know what could be going on here?

推荐答案

在 pyspark 文档 (https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder):

Found this in the pyspark docs (https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder):

...有 5 个类别,输入值 2.0 将映射到输出向量 [0.0, 0.0, 1.0, 0.0].默认情况下不包括最后一个类别(可通过 dropLast 配置),因为它使向量条目总和为 1,因此线性相关.所以输入值 4.0 映射到 [0.0, 0.0, 0.0, 0.0].

...with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

关于为什么要进行这种最后类别删除的更多讨论可以在这里找到(http://www.algosome.com/articles/dummy-variable-trap-regression.html)和这里(https://stats.stackexchange.com/q/290526/167299).

More discussion about why this kind of last-category-dropping would be done can be found here (http://www.algosome.com/articles/dummy-variable-trap-regression.html) and here (https://stats.stackexchange.com/q/290526/167299).

我对任何类型的机器学习都很陌生,但似乎基本上(对于回归模型)删除最后一个分类值是为了避免所谓的虚拟变量陷阱 其中自变量是多重共线的 - 两个或多个变量高度相关的场景;简单来说,可以从其他变量中预测一个变量"(所以基本上你会有一个冗余特征(我假设是不适合加权 ML 模型)).

I am pretty new to machine learning of any kind, but it seems that basically (for regression models) dropping the last categorical value is done to avoid something called the dummy variable trap where "the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others" (so basically you'd have a redundant feature (which I assume is not good for weighting a ML model)).

例如.当 [isBoy, isGirl] 的编码将传达有关某人性别的相同信息时,不需要 [isBoy, isGirl, unspecified] 的 1hot 编码,这里是 [1,0]=isBoy[0,1]=isGirl[0,0]=unspecified.

Eg. don't need a 1hot encoding of [isBoy, isGirl, unspecified] when an encoding of [isBoy, isGirl] would communicate the same information about someone's gender, here [1,0]=isBoy, [0,1]=isGirl, and [0,0]=unspecified.

这个链接(http://www.algosome.com/文章/dummy-variable-trap-regression.html) 提供了一个很好的例子,结论是

This link (http://www.algosome.com/articles/dummy-variable-trap-regression.html) provides a good example, with the conclusion being

虚拟变量陷阱的解决方法是去掉一个分类变量(或者去掉截距常数)——如果有m个类别,在模型中使用m-1,去掉的值可以是被认为是参考值,其余类别的拟合值代表此参考的变化.

The solution to the dummy variable trap is to drop one of the categorical variables (or alternatively, drop the intercept constant) - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change from this reference.

** 注意:在寻找原始问题的答案时,发现了这个类似的 SO 帖子(为什么 Spark 的 OneHotEncoder 丢弃最后一个默认分类?).然而,我认为这个当前的帖子保证存在,因为提到的帖子是关于为什么这种行为发生,而这篇文章是关于首先发生了什么 以及粘贴到谷歌时当前问题标题没有找到提到的帖子这一事实.

** Note: In looking for an answer to the original question, found this similar SO post (Why does Spark's OneHotEncoder drop the last category by default?). Yet, I think that this current post warrants existing since the mentioned post is about why this behavior happens while this post is about being confused as to what was going on in the first place as well as the fact that this current question title does not find the mentioned post when pasting to google.

这篇关于pyspark OneHotEncoded 向量似乎缺少类别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆