spark管道向量汇编程序删除其他列 [英] spark pipeline vector assembler drop other columns
问题描述
火花 VectorAssembler
http://spark.apache.org/docs/latest/ml-features.html#vectorassembler 产生以下输出
A spark VectorAssembler
http://spark.apache.org/docs/latest/ml-features.html#vectorassembler produces the following output
id | hour | mobile | userFeatures | clicked | features
----|------|--------|------------------|---------|-----------------------------
0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0 | [18.0, 1.0, 0.0, 10.0, 0.5]
如您所见,最后一列包含所有以前的功能.如果删除其他列是否更好/性能更高,例如仅保留标签/ID 和特征,还是这是不必要的开销,只需将标签/ID 和特征输入估算器就足够了?
as you can see the last column contains all the previous features. Is it better / more performant if the other columns are removed e.g. only the label/id and features are retained or is this an unnecessary overhead and just feeding label/id and features into the estimator is enough?
在管道中使用 VectorAssembler
时会发生什么?如果不手动删除原始列,将只使用最后一个特征还是会引入共线性(重复列)?
What happens when the VectorAssembler
is used in a pipeline? will only the last features be used or will it introduce colinearity (duplicate columns) if the original columns are not removed manually?
推荐答案
请仔细阅读文档.每个分类器都由功能列 (featuresCol
) 参数化.它不考虑任何其他列或列的顺序.
Please read carefully the documentation. Every classifier is parametrized by the features column (featuresCol
). It doesn't consider any other column or the order of columns.
这篇关于spark管道向量汇编程序删除其他列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!