spark管道向量汇编程序删除其他列 [英] spark pipeline vector assembler drop other columns

查看:22
本文介绍了spark管道向量汇编程序删除其他列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

火花 VectorAssembler http://spark.apache.org/docs/latest/ml-features.html#vectorassembler 产生以下输出

A spark VectorAssembler http://spark.apache.org/docs/latest/ml-features.html#vectorassembler produces the following output

id | hour | mobile | userFeatures     | clicked | features
----|------|--------|------------------|---------|-----------------------------
 0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0     | [18.0, 1.0, 0.0, 10.0, 0.5]

如您所见,最后一列包含所有以前的功能.如果删除其他列是否更好/性能更高,例如仅保留标签/ID 和特征,还是这是不必要的开销,只需将标签/ID 和特征输入估算器就足够了?

as you can see the last column contains all the previous features. Is it better / more performant if the other columns are removed e.g. only the label/id and features are retained or is this an unnecessary overhead and just feeding label/id and features into the estimator is enough?

在管道中使用 VectorAssembler 时会发生什么?如果不手动删除原始列,将只使用最后一个特征还是会引入共线性(重复列)?

What happens when the VectorAssembler is used in a pipeline? will only the last features be used or will it introduce colinearity (duplicate columns) if the original columns are not removed manually?

推荐答案

请仔细阅读文档.每个分类器都由功能列 (featuresCol) 参数化.它不考虑任何其他列或列的顺序.

Please read carefully the documentation. Every classifier is parametrized by the features column (featuresCol). It doesn't consider any other column or the order of columns.

这篇关于spark管道向量汇编程序删除其他列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆