在spark中进行特征选择后使测试数据的特征与训练数据相同 [英] Making the features of test data same as train data after featureselection in spark

查看：18 发布时间：2021/11/14 21:02:31 scala apache-spark feature-selection apache-spark-mllib apache-spark-ml

本文介绍了在spark中进行特征选择后使测试数据的特征与训练数据相同的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究 Scala.我有一个大问题， ChiSqSelector 似乎成功地减少了维度，但我无法确定哪些特征被减少了，剩下的部分.我如何知道减少了哪些功能?

I m working on Scala. I have a big question, ChiSqSelector seems to reduce dimension successfully, but I can't identify what features were reduced what were remained. How can I know what features were reduced?

[WrappedArray(a, b, c),(5,[1,2,3],[1,1,1]),(2,[0],[1])]
[WrappedArray(b, d, e),(5,[0,2,4],[1,1,2]),(2,[1],[2])]
[WrappedArray(a, c, d),(5,[0,1,3],[1,1,1]),(2,[0],[1])]

PS:当我想让测试数据与选择特征的训练数据相同时，我发现我不知道如何在 Scala 中做到这一点.

PS: when I wanted to make the test data same as feature-selected train data I found that I dont know how to do that in scala.

推荐答案

如果您使用 MLlib 版本的 ChiSqSelector，您可以selectedFeatures:

If you use MLlib version of the ChiSqSelector you can selectedFeatures:

mllibModel: org.apache.spark.mllib.feature.ChiSqSelectorModel = ???
val features: Array[Int] = mllib.selectedFeatures

尽管如此，当您处理测试数据时，最好使用在训练数据集上训练的选择器，而不要费心进行手动选择.

Nevertheless when you work with test data it is better to use selector trained on the train dataset and don't bother with manual selection.

val testData: RDD[org.apache.spark.mllib.linalg.Vector] = ???
mllibModel.transform(testData)

同样的规则适用于 ML 版本.您可以使用 selectedFeatures 来提取索引数组:

The same rules apply to ML version. You can use selectedFeatures to extract array of indices:

val mlModel: org.apache.spark.ml.feature.ChiSqSelectorModel = ???
val features: Array[Int] = mlModel.selectedFeatures

但最好保持模型并在新数据上重用:

but it is still better to keep model and reuse on new data:

val testData: RDD[org.apache.spark.sql.DataFrame] = ???
mlModel.transform(testData)

如果您想要一个人类可读的功能列表，您可以在转换后分析列元数据，如在 Spark 中将列标记为分类一个>

If you want a human readable list of features you can analyze column metadata after transformation as shown in Tagging columns as Categorical in Spark

这篇关于在spark中进行特征选择后使测试数据的特征与训练数据相同的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在spark中进行特征选择后使测试数据的特征与训练数据相同 [英] Making the features of test data same as train data after featureselection in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在spark中进行特征选择后使测试数据的特征与训练数据相同 [英] Making the features of test data same as train data after featureselection in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭