星火：在多个dataframes使用相同OneHotEn codeR [英] Spark: Use same OneHotEncoder on multiple dataframes

查看：629 发布时间：2016/5/22 16:40:21 python apache-spark pyspark

本文介绍了星火：在多个dataframes使用相同OneHotEn codeR的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个 DataFrames 与同列，我想转换
一个明确列进用一热编码的载体。问题是
即，例如，在训练集3唯一值可能，而在发生
测试集可能具有比较少。

I have two DataFrames with the same columns and I want to convert a categorical column into a vector using One-Hot-Encoding. The problem is that for example, in the training-set 3 unique values may occur while in the test-set you may have less than that.

Training Set:        Test Set:

+------------+       +------------+
|    Type    |       |    Type    |
+------------+       +------------+
|     0      |       |     0      | 
|     1      |       |     1      | 
|     1      |       |     1      | 
|     3      |       |     1      | 
+------------+       +------------+

在这种情况下，对不同长度 OneHotEn codeR 创建矢量训练和测试集（因为向量重新$ P $中的每个元素psents的$独特的价值的对$ psence）。

In this case the OneHotEncoder creates vectors with different length on the training and test set (since each element of the vector represents the presence of a unique value).

是否有可能使用相同的 OneHotEn codeR 多个 DataFrames ？
没有配合功能，所以我不知道我怎么能做到这一点。
谢谢你。

Is it possible to use the same OneHotEncoder on multiple DataFrames? There is no fit function and so I don't know how I could do that. Thanks.

推荐答案

OneHotEn codeR 不打算单独使用。相反，它应该是一个管道它可以利用列元数据的一部分。考虑下面的例子：

OneHotEncoder is not intended to be used alone. Instead it should be a part of a Pipeline where it can leverage column metadata. Consider following example:

training = sc.parallelize([(0., ), (1., ), (1., ), (3., )]).toDF(["type"])
testing  = sc.parallelize([(0., ), (1., ), (1., ), (1., )]).toDF(["type"])

当您使用EN codeR直接它有没有对环境的知识：

When you use encoder directly it has no knowledge about the context:

from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder().setOutputCol("encoded").setDropLast(False)


encoder.setInputCol("type").transform(training).show()
## +----+-------------+
## |type|      encoded|
## +----+-------------+
## | 0.0|(4,[0],[1.0])|
## | 1.0|(4,[1],[1.0])|
## | 1.0|(4,[1],[1.0])|
## | 3.0|(4,[3],[1.0])|
## +----+-------------+


encoder.setInputCol("type").transform(testing).show()
## +----+-------------+
## |type|      encoded|
## +----+-------------+
## | 0.0|(2,[0],[1.0])|
## | 1.0|(2,[1],[1.0])|
## | 1.0|(2,[1],[1.0])|
## | 1.0|(2,[1],[1.0])|
## +----+-------------+

现在让我们来添加所需的元数据。

Now lets add required metadata. It can be for example by using StringIndexer:

indexer = (StringIndexer()
  .setInputCol("type")
  .setOutputCol("type_idx")
  .fit(training))

如果你申请的连接索引列上的codeR，你会在两个数据集获得一致的编码方式：

If you apply encoder on the indexed column you'll get consistent encoding on both data sets:

(encoder.setInputCol("type_idx")
   .transform(indexer.transform(training))
   .show())

## +----+--------+-------------+
## |type|type_idx|      encoded|
## +----+--------+-------------+
## | 0.0|     1.0|(3,[1],[1.0])|
## | 1.0|     0.0|(3,[0],[1.0])|
## | 1.0|     0.0|(3,[0],[1.0])|
## | 3.0|     2.0|(3,[2],[1.0])|
## +----+--------+-------------+

（EN codeR
       .setInputCol（type_idx）
       .transform（indexer.transform（测试））
       .show（））

(encoder .setInputCol("type_idx") .transform(indexer.transform(testing)) .show())

## +----+--------+-------------+
## |type|type_idx|      encoded|
## +----+--------+-------------+
## | 0.0|     1.0|(3,[1],[1.0])|
## | 1.0|     0.0|(3,[0],[1.0])|
## | 1.0|     0.0|(3,[0],[1.0])|
## | 1.0|     0.0|(3,[0],[1.0])|
## +----+--------+-------------+

请注意，你会得到这样的标签不反映输入数据值。如果一致的编码是一个硬要求，您应该手动模式提供：

Please note that the labels you get this way don't reflect values in the input data. If consistent encoding is a hard requirement you should provide schema manually:

from pyspark.sql.types import StructType, StructField, DoubleType

meta = {"ml_attr": {
    "name": "type",
    "type": "nominal", 
    "vals": ["0.0", "1.0", "3.0"]
}}

schema = StructType([StructField("type", DoubleType(), False, meta)])

training = sc.parallelize([(0., ), (1., ), (1., ), (3., )]).toDF(schema)
testing  = sc.parallelize([(0., ), (1., ), (1., ), (1., )]).toDF(schema)

assert (
    encoder.setInputCol("type").transform(training).first()[-1].size == 
    encoder.setInputCol("type").transform(testing).first()[-1].size
)

这篇关于星火：在多个dataframes使用相同OneHotEn codeR的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

星火：在多个dataframes使用相同OneHotEn codeR [英] Spark: Use same OneHotEncoder on multiple dataframes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

星火：在多个dataframes使用相同OneHotEn codeR [英] Spark: Use same OneHotEncoder on multiple dataframes

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭