PySpark:OneHotEncoder的输出看起来很奇怪 [英] PySpark: Output of OneHotEncoder looks odd
本文介绍了PySpark:OneHotEncoder的输出看起来很奇怪的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
Spark文档包含 PySpark示例为其OneHotEncoder
:
The Spark documentation contains a PySpark example for its OneHotEncoder
:
from pyspark.ml.feature import OneHotEncoder, StringIndexer
df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()
我希望列categoryVec
看起来像这样:
I was expecting the column categoryVec
to look like this:
[0.0, 0.0]
[1.0, 0.0]
[0.0, 1.0]
[0.0, 0.0]
[0.0, 0.0]
[0.0, 1.0]
但是categoryVec
实际上看起来像这样:
But categoryVec
actually looks like this:
(2, [0], [1.0])
(2, [], [])
(2, [1], [1.0])
(2, [0], [1.0])
(2, [0], [1.0])
(2, [1], [1.0])
这是什么意思?我应该如何读取此输出,这种奇怪的格式背后的原因是什么?
What does this mean? How should I read this output, and what is the reasoning behind this somewhat odd format?
推荐答案
这里没什么奇怪的.这些只是SparseVectors
其中:
Nothing odd here. These are just SparseVectors
where:
- 第一个元素是向量的大小
- 第一个数组
[...]
是索引列表. - 第二个数组是值列表.
- The first element is size of the vector
- The first array
[...]
is a list of indices. - The second array is the list of values.
未明确列出的指数为0.0.
Indices not explicitly listed are 0.0.
这篇关于PySpark:OneHotEncoder的输出看起来很奇怪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文