PySpark:OneHotEncoder的输出看起来很奇怪 [英] PySpark: Output of OneHotEncoder looks odd

查看：1289 发布时间：2020/9/4 18:46:37 apache-spark pyspark apache-spark-mllib one-hot-encoding

本文介绍了PySpark:OneHotEncoder的输出看起来很奇怪的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Spark文档包含 PySpark示例为其OneHotEncoder:

The Spark documentation contains a PySpark example for its OneHotEncoder:

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = spark.createDataFrame([
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, "a"),
    (4, "a"),
    (5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

我希望列categoryVec看起来像这样:

I was expecting the column categoryVec to look like this:

[0.0, 0.0]
[1.0, 0.0]
[0.0, 1.0]
[0.0, 0.0]
[0.0, 0.0]
[0.0, 1.0]

但是categoryVec实际上看起来像这样:

But categoryVec actually looks like this:

(2, [0], [1.0])
    (2, [], [])
(2, [1], [1.0])
(2, [0], [1.0])
(2, [0], [1.0])
(2, [1], [1.0])

这是什么意思?我应该如何读取此输出，这种奇怪的格式背后的原因是什么?

What does this mean? How should I read this output, and what is the reasoning behind this somewhat odd format?

PySpark:OneHotEncoder的输出看起来很奇怪 [英] PySpark: Output of OneHotEncoder looks odd

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PySpark:OneHotEncoder的输出看起来很奇怪 [英] PySpark: Output of OneHotEncoder looks odd

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭