星火:在多个dataframes使用相同OneHotEn codeR [英] Spark: Use same OneHotEncoder on multiple dataframes
问题描述
我有两个 DataFrames
与同列,我想转换
一个明确列进用一热编码的载体。问题是
即,例如,在训练集3唯一值可能,而在发生
测试集可能具有比较少。
I have two DataFrames
with the same columns and I want to convert
a categorical column into a vector using One-Hot-Encoding. The problem is
that for example, in the training-set 3 unique values may occur while in
the test-set you may have less than that.
Training Set: Test Set:
+------------+ +------------+
| Type | | Type |
+------------+ +------------+
| 0 | | 0 |
| 1 | | 1 |
| 1 | | 1 |
| 3 | | 1 |
+------------+ +------------+
在这种情况下,对不同长度 OneHotEn codeR
创建矢量训练和测试集(因为向量重新$ P $中的每个元素psents的$独特的价值的对$ psence)。
In this case the OneHotEncoder
creates vectors with different length on the training and test set (since each element of the vector represents the presence of a unique value).
是否有可能使用相同的 OneHotEn codeR
多个 DataFrames
?
没有配合
功能,所以我不知道我怎么能做到这一点。
谢谢你。
Is it possible to use the same OneHotEncoder
on multiple DataFrames
?
There is no fit
function and so I don't know how I could do that.
Thanks.
推荐答案
OneHotEn codeR
不打算单独使用。相反,它应该是一个管道
它可以利用列元数据的一部分。考虑下面的例子:
OneHotEncoder
is not intended to be used alone. Instead it should be a part of a Pipeline
where it can leverage column metadata. Consider following example:
training = sc.parallelize([(0., ), (1., ), (1., ), (3., )]).toDF(["type"])
testing = sc.parallelize([(0., ), (1., ), (1., ), (1., )]).toDF(["type"])
当您使用EN codeR直接它有没有对环境的知识:
When you use encoder directly it has no knowledge about the context:
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder().setOutputCol("encoded").setDropLast(False)
encoder.setInputCol("type").transform(training).show()
## +----+-------------+
## |type| encoded|
## +----+-------------+
## | 0.0|(4,[0],[1.0])|
## | 1.0|(4,[1],[1.0])|
## | 1.0|(4,[1],[1.0])|
## | 3.0|(4,[3],[1.0])|
## +----+-------------+
encoder.setInputCol("type").transform(testing).show()
## +----+-------------+
## |type| encoded|
## +----+-------------+
## | 0.0|(2,[0],[1.0])|
## | 1.0|(2,[1],[1.0])|
## | 1.0|(2,[1],[1.0])|
## | 1.0|(2,[1],[1.0])|
## +----+-------------+
现在让我们来添加所需的元数据。
Now lets add required metadata. It can be for example by using StringIndexer
:
indexer = (StringIndexer()
.setInputCol("type")
.setOutputCol("type_idx")
.fit(training))
如果你申请的连接索引列上的codeR,你会在两个数据集获得一致的编码方式:
If you apply encoder on the indexed column you'll get consistent encoding on both data sets:
(encoder.setInputCol("type_idx")
.transform(indexer.transform(training))
.show())
## +----+--------+-------------+
## |type|type_idx| encoded|
## +----+--------+-------------+
## | 0.0| 1.0|(3,[1],[1.0])|
## | 1.0| 0.0|(3,[0],[1.0])|
## | 1.0| 0.0|(3,[0],[1.0])|
## | 3.0| 2.0|(3,[2],[1.0])|
## +----+--------+-------------+
(EN codeR
.setInputCol(type_idx)
.transform(indexer.transform(测试))
.show())
(encoder .setInputCol("type_idx") .transform(indexer.transform(testing)) .show())
## +----+--------+-------------+
## |type|type_idx| encoded|
## +----+--------+-------------+
## | 0.0| 1.0|(3,[1],[1.0])|
## | 1.0| 0.0|(3,[0],[1.0])|
## | 1.0| 0.0|(3,[0],[1.0])|
## | 1.0| 0.0|(3,[0],[1.0])|
## +----+--------+-------------+
请注意,你会得到这样的标签不反映输入数据值。如果一致的编码是一个硬要求,您应该手动模式提供:
Please note that the labels you get this way don't reflect values in the input data. If consistent encoding is a hard requirement you should provide schema manually:
from pyspark.sql.types import StructType, StructField, DoubleType
meta = {"ml_attr": {
"name": "type",
"type": "nominal",
"vals": ["0.0", "1.0", "3.0"]
}}
schema = StructType([StructField("type", DoubleType(), False, meta)])
training = sc.parallelize([(0., ), (1., ), (1., ), (3., )]).toDF(schema)
testing = sc.parallelize([(0., ), (1., ), (1., ), (1., )]).toDF(schema)
assert (
encoder.setInputCol("type").transform(training).first()[-1].size ==
encoder.setInputCol("type").transform(testing).first()[-1].size
)
这篇关于星火:在多个dataframes使用相同OneHotEn codeR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!