ChiSqSelector - 真实功能 - Spark [英] ChiSqSelector - Real Features - Spark
问题描述
我正在使用 Spark 1.6 构建一个 NB 模型,并使用 ChiSqSelector 来识别主要特征.我总共有 7 个特征并寻找前 3 个特征.虽然过程运行良好,但我将如何识别被评为顶级特征的实际特征.由于数据已分类,我无法将输出映射到实际输入列.
I am building a NB model with Spark 1.6 and using ChiSqSelector to identify the top features. I have a total of 7 features and looking for the top 3. While the process runs fine, how will i identify the actual feature that has been rated as the top feature. Since the data is categorized, i am not able to map the output to the actual input column.
val chidata = cat_recs.map(r => (r.getDouble(targetInd), Vectors.dense(featuresidx.map(r.getDouble(_)).toArray))).toDF("target","features")
val sel = new ChiSqSelector().setNumTopFeatures(3).setFeaturesCol("features").setLabelCol("target").setOutputCol("selectedFeatuers")
val chiresult = sel.fit(chidata).transform(chidata)
输出为
scala> chiresult.foreach(println)
[1.0,[0.0,2.0,0.0,5.0,7.0,5.0,1.0],[0.0,5.0,7.0]]
[1.0,[4.0,3.0,0.0,5.0,7.0,5.0,1.0],[0.0,5.0,7.0]]
[0.0,[3.0,2.0,0.0,5.0,7.0,5.0,3.0],[0.0,5.0,7.0]]
[1.0,[1.0,2.0,0.0,1.0,7.0,5.0,2.0],[0.0,1.0,7.0]]
[1.0,[0.0,2.0,0.0,1.0,7.0,5.0,3.0],[0.0,1.0,7.0]]
结构——目标:双,特征:向量,selectedFeatures:向量从上面,我们以第一行为例
Structure -- target: double, features: vector, selectedFeatures: vector From the above, lets take the example of the first row
[1.0,[0.0,2.0,0.0,5.0,7.0,5.0,1.0],[0.0,5.0,7.0]]
我如何识别它在 selectedFeatures 中引用的 0.0,同样在第 5 行.
how can i identify which 0.0 it is referring to in the selectedFeatures, similarly in 5th row as well.
请帮忙..
谢谢
巴拉
推荐答案
在您的示例中:
[1.0,[0.0,2.0,0.0,5.0,7.0,5.0,1.0],[0.0,5.0,7.0]]
最后一列 [0.0,5.0,7.0]
表示所选特征的值,在本例中为特征 2、3 和 4(从 0 开始计数).要提取未来指数,只需使用
the last column [0.0,5.0,7.0]
represents the values of selected features, in this case features 2, 3 and 4 (counting from 0). To extract future indices just use
val model = sel.fit(chidata)
val importantFeatures = model.selectedFeatures
这篇关于ChiSqSelector - 真实功能 - Spark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!