pyspark OneHotEncoded 向量似乎缺少类别? [英] pyspark OneHotEncoded vectors appear to be missing categories?

查看：32 发布时间：2021/11/14 21:11:06 pyspark apache-spark-mllib

本文介绍了pyspark OneHotEncoded 向量似乎缺少类别?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在尝试使用 pyspark 的 OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) 看起来像是 onehot向量缺少某些类别(或者显示时格式可能很奇怪?).

Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?).

现在回答这个问题(或提供一个答案)后，似乎下面的细节与理解问题并不完全相关

After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem

有表单的数据集

1. Wife's age                     (numerical)
2. Wife's education               (categorical)      1=low, 2, 3, 4=high
3. Husband's education            (categorical)      1=low, 2, 3, 4=high
4. Number of children ever born   (numerical)
5. Wife's religion                (binary)           0=Non-Islam, 1=Islam
6. Wife's now working?            (binary)           0=Yes, 1=No
7. Husband's occupation           (categorical)      1, 2, 3, 4
8. Standard-of-living index       (categorical)      1=low, 2, 3, 4=high
9. Media exposure                 (binary)           0=Good, 1=Not good
10. Contraceptive method used     (class attribute)  1=No-use, 2=Long-term, 3=Short-term

实际数据看起来像

wife_age,wife_edu,husband_edu,num_children,wife_religion,wife_working,husband_occupation,SoL_index,media_exposure,contraceptive
24,2,3,3,1,1,2,3,0,1
45,1,3,10,1,1,3,4,0,1

来源:https://archive.ics.uci.edu/ml/datasets/避孕+方法+选择.

在对数据进行一些其他预处理之后，然后尝试通过...将分类和二进制(仅用于练习)特征编码为 1hot 向量.

After doing some other preprocessing on the data, then trying to encode the categorical and binary (just for the sake of practice) features to 1hot vectors via...

for inds in ['wife_edu', 'husband_edu', 'husband_occupation', 'SoL_index', 'wife_religion', 'wife_working', 'media_exposure', 'contraceptive']:
    encoder = OneHotEncoder(inputCol=inds, outputCol='%s_1hot' % inds)
    print encoder.k
    dataset = encoder.transform(dataset)

产生一行看起来像

Row(
    ...., 
    numeric_features=DenseVector([24.0, 3.0]), numeric_features_normalized=DenseVector([-1.0378, -0.1108]), 
    wife_edu_1hot=SparseVector(4, {2: 1.0}), 
    husband_edu_1hot=SparseVector(4, {3: 1.0}), 
    husband_occupation_1hot=SparseVector(4, {2: 1.0}), 
    SoL_index_1hot=SparseVector(4, {3: 1.0}), 
    wife_religion_1hot=SparseVector(1, {0: 1.0}),
    wife_working_1hot=SparseVector(1, {0: 1.0}),
    media_exposure_1hot=SparseVector(1, {0: 1.0}),
    contraceptive_1hot=SparseVector(2, {0: 1.0})
)

我对稀疏向量格式的理解是 SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn}) 暗示了一个长度为 S 的向量其中所有值为 0 期望索引 i1,...，其中具有相应的值 v1,...,vn (https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html).

My understanding of sparse vector format is that SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn}) implies a vector of length S where all values are 0 expect for indices i1,...,in which have corresponding values v1,...,vn (https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html).

基于此，看起来这种中的 SparseVector 实际上表示向量中的最高索引(而不是大小).此外，结合所有特征(通过 pyspark 的 VectorAssembler)并检查结果 dataset.head(n=1) 向量的数组版本显示

Based on this, it seems like the SparseVector in this case actually denotes the highest index in the vector (not the size). Furthermore, combining all the features (via pyspark's VectorAssembler) and checking the array version of the resulting dataset.head(n=1) vector shows

input_features=SparseVector(23, {0: -1.0378, 1: -0.1108, 4: 1.0, 9: 1.0, 12: 1.0, 17: 1.0, 18: 1.0, 19: 1.0, 20: 1.0, 21: 1.0})

indicates a vector looking like

indices:  0        1       2  3  4...           9        12             17 18 19 20 21
        [-1.0378, -0.1108, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]

我认为应该不可能有 >= 3 个连续 1 的序列(如上面向量的尾部附近所见)，因为这表明 onehot 向量之一(例如中间的1) 只有大小为 1，这对任何数据特征都没有意义.

I would think that it should be impossible to have a sequence of >= 3 consecutive 1s (as can be seen near the tail of the vector above), as this would indicate that one of the onehot vectors (eg. the middle 1) is only of size 1, which would not make sense for any of the data features.

对机器学习非常陌生，所以可能会对这里的一些基本概念感到困惑，但有人知道这里会发生什么吗?

Very new to machine learning stuff, so may be confused about some basic concepts here, but does anyone know what could be going on here?

pyspark OneHotEncoded 向量似乎缺少类别? [英] pyspark OneHotEncoded vectors appear to be missing categories?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark OneHotEncoded 向量似乎缺少类别? [英] pyspark OneHotEncoded vectors appear to be missing categories?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭