使用VectorAssembler处理动态列 [英] Dealing with dynamic columns with VectorAssembler

查看:244
本文介绍了使用VectorAssembler处理动态列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用spark矢量汇编器,需要预先定义要组装的列.

Using sparks vector assembler the columns to be assembled need to be defined up front.

但是,如果在前面步骤将修改数据帧的列的管道中使用向量汇编器,如何在不手动对所有值进行硬编码的情况下指定列?

However, if using the vector-assembler in a pipeline where the previous steps will modify the columns of the data frame how can I specify the columns without hard coding all the value manually?

因为当调用向量汇编器的构造函数时,df.columns包含正确的值,但是我看不到另一种处理或拆分管道的方法,这也很糟糕因为CrossValidator将不再正常工作.

As df.columns will not contain the right values when the constructor is called of vector-assembler currently I do not see another way to handle that or to split the pipeline - which is bad as well because CrossValidator will no longer properly work.

val vectorAssembler = new VectorAssembler()
    .setInputCols(df.columns
      .filter(!_.contains("target"))
      .filter(!_.contains("idNumber")))
    .setOutputCol("features")

编辑

的初始df

edit

initial df of

---+------+---+-
|foo|   id|baz|
+---+------+---+
|  0| 1    |  A|
|  1|2     |  A|
|  0| 3    |  null|
|  1| 4    |  C|
+---+------+---+

将进行如下转换.您会看到,对于最频繁出现的原始列和某些衍生出的特征,例如nan值,将推算出nan值.如此处isA所述,如果baz为A,则为1,否则为0,并且如果最初为null,则为N

will be transformed as follows. You can see that nan values will be imputed for original columns with most frequent and some features derived e.g. as outlined here isA which is 1 if baz is A, 0 otherwise and if null originally N

+---+------+---+-------+
|foo|id    |baz| isA    |
+---+------+---+-------+
|  0| 1    |  A| 1      |
|  1|2     |  A|1       |
|  0| 3    |   A|    n  |
|  1| 4    |  C|    0   |
+---+------+---+-------+

稍后在管道中,使用stringIndexer使数据适合ML/vectorAssembler.

Later on in the pipeline, a stringIndexer is used to make the data fit for ML / vectorAssembler.

isA在原始df中不存在,但不是唯一"输出列,该帧中除foo和id列之外的所有列均应由矢量汇编器转换.

isA is not present in the original df, but not the "only" output column all the columns in this frame except foo and an id column should be transformed by the vector assembler.

我希望现在清楚了.

推荐答案

我创建了一个自定义矢量汇编程序(原始1:1副本),然后将其更改为包括所有列,但传递的列除外.

I created a custom vector assembler (1:1 copy of original) and then changed it to include all columns except some which are passed to be excluded.

使其更加清晰

def setInputColsExcept(value: Array[String]): this.type = set(inputCols, value)

指定应排除的列.然后

val remainingColumns = dataset.columns.filter(!$(inputCols).contains(_))

transform方法中的内容是过滤所需的列.

in the transform method is filtering for desired columns.

这篇关于使用VectorAssembler处理动态列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆