为什么默认情况下,Spark的OneHotEncoder删除最后一个类别? [英] Why does Spark's OneHotEncoder drop the last category by default?

查看:151
本文介绍了为什么默认情况下,Spark的OneHotEncoder删除最后一个类别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解Spark的OneHotEncoder默认删除最后一个类别的背后原因.

I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default.

例如:

>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
|   x|  c|c_idx|
+----+---+-----+
| 1.0|  a|  0.0|
| 1.5|  a|  0.0|
|10.0|  b|  1.0|
| 3.2|  c|  2.0|
+----+---+-----+

默认情况下,OneHotEncoder将删除最后一个类别:

By default, the OneHotEncoder will drop the last category:

>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(2,[0],[1.0])|
| 1.5|  a|  0.0|(2,[0],[1.0])|
|10.0|  b|  1.0|(2,[1],[1.0])|
| 3.2|  c|  2.0|    (2,[],[])|
+----+---+-----+-------------+

当然,可以更改此行为:

Of course, this behavior can be changed:

>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(3,[0],[1.0])|
| 1.5|  a|  0.0|(3,[0],[1.0])|
|10.0|  b|  1.0|(3,[1],[1.0])|
| 3.2|  c|  2.0|(3,[2],[1.0])|
+----+---+-----+-------------+

问题::

  • 在什么情况下需要默认行为?
  • 盲目调用setDropLast(False)可能会忽略哪些问题?
  • 文档中的以下陈述作者是什么意思?
  • In what case is the default behavior desirable?
  • What issues might be overlooked by blindly calling setDropLast(False)?
  • What do the authors mean by the following statment in the documentation?

默认情况下不包括最后一个类别(可通过dropLast进行配置),因为它使向量项的总和为1,因此线性相关.

The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent.

推荐答案

根据文档,列的保持独立性:

According to the doc it is to keep the column independents :

一键编码器,将类别索引的列映射到列 二元向量,每行最多有一个单值, 指示输入类别索引.例如,有5个类别, 输入值2.0会映射到[0.0、0.0、1.0, 0.0].默认情况下不包括最后一个类别(可通过OneHotEncoder!.dropLast进行配置,因为它使向量条目的总和为 一,因此线性相关.因此输入值为4.0映射到 [0.0,0.0,0.0,0.0].请注意,这与scikit-learn的不同 OneHotEncoder,保留所有类别.输出向量是 稀疏.

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via OneHotEncoder!.dropLast because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Note that this is different from scikit-learn's OneHotEncoder, which keeps all categories. The output vectors are sparse.

https: //spark.apache.org/docs/1.5.2/api/java/org/apache/spark/ml/feature/OneHotEncoder.html

这篇关于为什么默认情况下,Spark的OneHotEncoder删除最后一个类别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆