pyspark OneHotEncoded向量似乎缺少类别? [英] pyspark OneHotEncoded vectors appear to be missing categories?

查看:121
本文介绍了pyspark OneHotEncoded向量似乎缺少类别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用pyspark的OneHotEncoder(

Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?).

现在回答了这个问题(或提供了一个答案)之后,下面的详细信息似乎与理解问题并不完全相关

After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem

具有以下格式的数据集

1. Wife's age                     (numerical)
2. Wife's education               (categorical)      1=low, 2, 3, 4=high
3. Husband's education            (categorical)      1=low, 2, 3, 4=high
4. Number of children ever born   (numerical)
5. Wife's religion                (binary)           0=Non-Islam, 1=Islam
6. Wife's now working?            (binary)           0=Yes, 1=No
7. Husband's occupation           (categorical)      1, 2, 3, 4
8. Standard-of-living index       (categorical)      1=low, 2, 3, 4=high
9. Media exposure                 (binary)           0=Good, 1=Not good
10. Contraceptive method used     (class attribute)  1=No-use, 2=Long-term, 3=Short-term  

实际数据类似于

wife_age,wife_edu,husband_edu,num_children,wife_religion,wife_working,husband_occupation,SoL_index,media_exposure,contraceptive
24,2,3,3,1,1,2,3,0,1
45,1,3,10,1,1,3,4,0,1

从此处获得: https://archive.ics.uci .edu/ml/datasets/Contraceptive + Method + Choice .

在对数据进行了其他一些预处理之后,然后尝试通过...将分类和二进制(仅出于实践目的)特征编码为1hot向量.

After doing some other preprocessing on the data, then trying to encode the categorical and binary (just for the sake of practice) features to 1hot vectors via...

for inds in ['wife_edu', 'husband_edu', 'husband_occupation', 'SoL_index', 'wife_religion', 'wife_working', 'media_exposure', 'contraceptive']:
    encoder = OneHotEncoder(inputCol=inds, outputCol='%s_1hot' % inds)
    print encoder.k
    dataset = encoder.transform(dataset)

产生的行看起来像

Row(
    ...., 
    numeric_features=DenseVector([24.0, 3.0]), numeric_features_normalized=DenseVector([-1.0378, -0.1108]), 
    wife_edu_1hot=SparseVector(4, {2: 1.0}), 
    husband_edu_1hot=SparseVector(4, {3: 1.0}), 
    husband_occupation_1hot=SparseVector(4, {2: 1.0}), 
    SoL_index_1hot=SparseVector(4, {3: 1.0}), 
    wife_religion_1hot=SparseVector(1, {0: 1.0}),
    wife_working_1hot=SparseVector(1, {0: 1.0}),
    media_exposure_1hot=SparseVector(1, {0: 1.0}),
    contraceptive_1hot=SparseVector(2, {0: 1.0})
)

我对稀疏矢量格式的理解是SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn})表示一个长度为S的矢量,其中所有值都是0,期望索引i1,...具有对应的值v1,...,vn( https://www.cs.umd.edu/Outreach/hsContest99/questions/node3 .html ).

My understanding of sparse vector format is that SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn}) implies a vector of length S where all values are 0 expect for indices i1,...,in which have corresponding values v1,...,vn (https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html).

基于此,看来在 this 情况下的SparseVector实际上表示矢量中的最高索引(而不是大小).此外,结合所有功能(通过pyspark的VectorAssembler)并检查生成的dataset.head(n=1)矢量显示的数组版本

Based on this, it seems like the SparseVector in this case actually denotes the highest index in the vector (not the size). Furthermore, combining all the features (via pyspark's VectorAssembler) and checking the array version of the resulting dataset.head(n=1) vector shows

input_features=SparseVector(23, {0: -1.0378, 1: -0.1108, 4: 1.0, 9: 1.0, 12: 1.0, 17: 1.0, 18: 1.0, 19: 1.0, 20: 1.0, 21: 1.0})

indicates a vector looking like

indices:  0        1       2  3  4...           9        12             17 18 19 20 21
        [-1.0378, -0.1108, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]

我认为应该不可能有一个大于等于3个连续1s的序列(如在上述向量的尾部附近可以看到的那样),因为这将指示一个热门向量(例如中间向量) 1)的大小仅为1,对于任何数据功能都没有意义.

I would think that it should be impossible to have a sequence of >= 3 consecutive 1s (as can be seen near the tail of the vector above), as this would indicate that one of the onehot vectors (eg. the middle 1) is only of size 1, which would not make sense for any of the data features.

对于机器学习的东西来说是非常新的东西,所以这里可能对一些基本概念感到困惑,但是有人知道这里会发生什么吗?

Very new to machine learning stuff, so may be confused about some basic concepts here, but does anyone know what could be going on here?

推荐答案

在pyspark文档(

Found this in the pyspark docs (https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder):

...具有5个类别,输入值2.0将映射到[0.0,0.0,1.0,0.0]的输出向量. 默认情况下不包括最后一个类别(可通过dropLast进行配置),因为它使向量项的总和为1,因此线性相关.因此输入值4.0映射为[0.0,0.0,0.0,0.0].

...with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

可以在此处找到有关为什么要进行最后一类丢弃操作的更多讨论( http://www.algosome.com/articles/dummy-variable-trap-regression.html )和此处(https://stats.stackexchange.com/q/290526/167299 ).

More discussion about why this kind of last-category-dropping would be done can be found here (http://www.algosome.com/articles/dummy-variable-trap-regression.html) and here (https://stats.stackexchange.com/q/290526/167299).

我对任何类型的机器学习都是相当陌生的,但是基本上(对于回归模型而言)删除最后一个分类值是为了避免发生称为dummy variable trap的事情,其中​​独立变量是多重共线性的,在这种情况下,两个或多个变量高度相关;简单来说,一个变量可以从另一个变量中预测出来"(因此,基本上,您将具有冗余功能(我认为这对加权ML模型不利. )).

I am pretty new to machine learning of any kind, but it seems that basically (for regression models) dropping the last categorical value is done to avoid something called the dummy variable trap where "the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others" (so basically you'd have a redundant feature (which I assume is not good for weighting a ML model)).

例如.当[isBoy, isGirl]的编码可以传达有关某人性别的相同信息(此处为[1,0]=isBoy[0,1]=isGirl[0,0]=unspecified)时,则不需要[isBoy, isGirl, unspecified]的1hot编码.

Eg. don't need a 1hot encoding of [isBoy, isGirl, unspecified] when an encoding of [isBoy, isGirl] would communicate the same information about someone's gender, here [1,0]=isBoy, [0,1]=isGirl, and [0,0]=unspecified.

此链接( http://www.algosome.com/article/dummy-variable-trap-regression.html )提供了一个很好的例子,结论是

This link (http://www.algosome.com/articles/dummy-variable-trap-regression.html) provides a good example, with the conclusion being

伪变量陷阱的解决方案是删除分类变量之一(或者删除截距常量)-如果存在m个类别,则在模型中使用m-1,则可以将遗漏的值设为被认为是参考值,其余类别的拟合值表示此参考的更改.

The solution to the dummy variable trap is to drop one of the categorical variables (or alternatively, drop the intercept constant) - if there are m number of categories, use m-1 in the model, the value left out can be thought of as the reference value and the fit values of the remaining categories represent the change from this reference.

**注意:在寻找原始问题的答案时,找到了类似的SO帖子( Spark的OneHotEncoder为什么会丢弃最后一个默认情况下是类别"?).但是,我认为当前职位值得保留,因为上述职位是关于为什么的行为,而此职位是关于首先发生的事情感到困惑的 >以及当前的问题标题在粘贴到google时找不到所提到的帖子的事实.

** Note: In looking for an answer to the original question, found this similar SO post (Why does Spark's OneHotEncoder drop the last category by default?). Yet, I think that this current post warrants existing since the mentioned post is about why this behavior happens while this post is about being confused as to what was going on in the first place as well as the fact that this current question title does not find the mentioned post when pasting to google.

这篇关于pyspark OneHotEncoded向量似乎缺少类别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆