pyspark OneHotEncoded向量似乎缺少类别? [英] pyspark OneHotEncoded vectors appear to be missing categories?

查看：121 发布时间：2020/9/4 18:46:30 pyspark apache-spark-mllib

本文介绍了pyspark OneHotEncoded向量似乎缺少类别?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用pyspark的OneHotEncoder(

Seeing a weird problem when trying to generate one-hot encoded vectors for categorical features using pyspark's OneHotEncoder (https://spark.apache.org/docs/2.1.0/ml-features.html#onehotencoder) where it seems like the onehot vectors are missing some categories (or are maybe formatted oddly when displayed?).

现在回答了这个问题(或提供了一个答案)之后，下面的详细信息似乎与理解问题并不完全相关

After having now answered this question (or providing an answer), it appears that the details below are not totally relevant to understanding the problem

具有以下格式的数据集

1. Wife's age                     (numerical)
2. Wife's education               (categorical)      1=low, 2, 3, 4=high
3. Husband's education            (categorical)      1=low, 2, 3, 4=high
4. Number of children ever born   (numerical)
5. Wife's religion                (binary)           0=Non-Islam, 1=Islam
6. Wife's now working?            (binary)           0=Yes, 1=No
7. Husband's occupation           (categorical)      1, 2, 3, 4
8. Standard-of-living index       (categorical)      1=low, 2, 3, 4=high
9. Media exposure                 (binary)           0=Good, 1=Not good
10. Contraceptive method used     (class attribute)  1=No-use, 2=Long-term, 3=Short-term

实际数据类似于

wife_age,wife_edu,husband_edu,num_children,wife_religion,wife_working,husband_occupation,SoL_index,media_exposure,contraceptive
24,2,3,3,1,1,2,3,0,1
45,1,3,10,1,1,3,4,0,1

从此处获得: https://archive.ics.uci .edu/ml/datasets/Contraceptive + Method + Choice .

在对数据进行了其他一些预处理之后，然后尝试通过...将分类和二进制(仅出于实践目的)特征编码为1hot向量.

After doing some other preprocessing on the data, then trying to encode the categorical and binary (just for the sake of practice) features to 1hot vectors via...

for inds in ['wife_edu', 'husband_edu', 'husband_occupation', 'SoL_index', 'wife_religion', 'wife_working', 'media_exposure', 'contraceptive']:
    encoder = OneHotEncoder(inputCol=inds, outputCol='%s_1hot' % inds)
    print encoder.k
    dataset = encoder.transform(dataset)

产生的行看起来像

Row(
    ...., 
    numeric_features=DenseVector([24.0, 3.0]), numeric_features_normalized=DenseVector([-1.0378, -0.1108]), 
    wife_edu_1hot=SparseVector(4, {2: 1.0}), 
    husband_edu_1hot=SparseVector(4, {3: 1.0}), 
    husband_occupation_1hot=SparseVector(4, {2: 1.0}), 
    SoL_index_1hot=SparseVector(4, {3: 1.0}), 
    wife_religion_1hot=SparseVector(1, {0: 1.0}),
    wife_working_1hot=SparseVector(1, {0: 1.0}),
    media_exposure_1hot=SparseVector(1, {0: 1.0}),
    contraceptive_1hot=SparseVector(2, {0: 1.0})
)

我对稀疏矢量格式的理解是SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn})表示一个长度为S的矢量，其中所有值都是0，期望索引i1，...具有对应的值v1，...，vn( https://www.cs.umd.edu/Outreach/hsContest99/questions/node3 .html ).

My understanding of sparse vector format is that SparseVector(S, {i1: v1}, {i2: v2}, ..., {in: vn}) implies a vector of length S where all values are 0 expect for indices i1,...,in which have corresponding values v1,...,vn (https://www.cs.umd.edu/Outreach/hsContest99/questions/node3.html).

基于此，看来在 this 情况下的SparseVector实际上表示矢量中的最高索引(而不是大小).此外，结合所有功能(通过pyspark的VectorAssembler)并检查生成的dataset.head(n=1)矢量显示的数组版本

Based on this, it seems like the SparseVector in this case actually denotes the highest index in the vector (not the size). Furthermore, combining all the features (via pyspark's VectorAssembler) and checking the array version of the resulting dataset.head(n=1) vector shows

input_features=SparseVector(23, {0: -1.0378, 1: -0.1108, 4: 1.0, 9: 1.0, 12: 1.0, 17: 1.0, 18: 1.0, 19: 1.0, 20: 1.0, 21: 1.0})

indicates a vector looking like

indices:  0        1       2  3  4...           9        12             17 18 19 20 21
        [-1.0378, -0.1108, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]

我认为应该不可能有一个大于等于3个连续1s的序列(如在上述向量的尾部附近可以看到的那样)，因为这将指示一个热门向量(例如中间向量) 1)的大小仅为1，对于任何数据功能都没有意义.

I would think that it should be impossible to have a sequence of >= 3 consecutive 1s (as can be seen near the tail of the vector above), as this would indicate that one of the onehot vectors (eg. the middle 1) is only of size 1, which would not make sense for any of the data features.

对于机器学习的东西来说是非常新的东西，所以这里可能对一些基本概念感到困惑，但是有人知道这里会发生什么吗?

Very new to machine learning stuff, so may be confused about some basic concepts here, but does anyone know what could be going on here?

pyspark OneHotEncoded向量似乎缺少类别? [英] pyspark OneHotEncoded vectors appear to be missing categories?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark OneHotEncoded向量似乎缺少类别? [英] pyspark OneHotEncoded vectors appear to be missing categories?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭