PySpark:OneHotEncoder的输出看起来很奇怪 [英] PySpark: Output of OneHotEncoder looks odd

查看:1289
本文介绍了PySpark:OneHotEncoder的输出看起来很奇怪的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark文档包含 PySpark示例为其OneHotEncoder:

The Spark documentation contains a PySpark example for its OneHotEncoder:

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = spark.createDataFrame([
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, "a"),
    (4, "a"),
    (5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

我希望列categoryVec看起来像这样:

I was expecting the column categoryVec to look like this:

[0.0, 0.0]
[1.0, 0.0]
[0.0, 1.0]
[0.0, 0.0]
[0.0, 0.0]
[0.0, 1.0]

但是categoryVec实际上看起来像这样:

But categoryVec actually looks like this:

(2, [0], [1.0])
    (2, [], [])
(2, [1], [1.0])
(2, [0], [1.0])
(2, [0], [1.0])
(2, [1], [1.0])    

这是什么意思?我应该如何读取此输出,这种奇怪的格式背后的原因是什么?

What does this mean? How should I read this output, and what is the reasoning behind this somewhat odd format?

推荐答案

这里没什么奇怪的.这些只是SparseVectors其中:

Nothing odd here. These are just SparseVectors where:

  • 第一个元素是向量的大小
  • 第一个数组[...]是索引列表.
  • 第二个数组是值列表.
  • The first element is size of the vector
  • The first array [...] is a list of indices.
  • The second array is the list of values.

未明确列出的指数为0.0.

Indices not explicitly listed are 0.0.

这篇关于PySpark:OneHotEncoder的输出看起来很奇怪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆