Spark 1.6 Pearson相关 [英] Spark 1.6 Pearson correlation

查看:72
本文介绍了Spark 1.6 Pearson相关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

火花1.6

如果我有一个数据集,并且我想通过使用Pearson相关性来识别具有最大预测能力的数据集中的特征,我应该使用哪些工具?

If I have a dataset and I want to identifiy the features in a dataset with the greatest predictive power by using Pearson correlation which tools should I use?

我使用的天真的方法是:

The naive approach I used... was:

val columns = x.columns.toList.filterNot(List("id","maxcykle","rul") contains)
val corrVithRul = columns.map( c =>  (c,x.stat.corr("rul", c, "pearson")) )

Output:

    columns: List[String] = List(cykle, setting1, setting2, setting3, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14, s15, s16, s17, s18, s19, s20, s21, label1, label2, a1, sd1, a2, sd2, a3, sd3, a4, sd4, a5, sd5, a6, sd6, a7, sd7, a8, sd8, a9, sd9, a10, sd10, a11, sd11, a12, sd12, a13, sd13, a14, sd14, a15, sd15, a16, sd16, a17, sd17, a18, sd18, a19, sd19, a20, sd20, a21, sd21)
    corrVithRul: List[(String, Double)] = List((cykle,-0.7362405993070199), (setting1,-0.0031984575547410617), (setting2,-0.001947628351500473), (setting3,NaN), (s1,-0.011460304217886725), (s2,-0.6064839743782909), (s3,-0.5845203909175897), (s4,-0.6789482333860454), (s5,-0.011121400898477964), (s6,-0.1283484484732187), (s7,0.6572226620548292), (s8,-0.5639684065744165), (s9,-0.3901015749180319), (s10,-0.04924720421765515), (s11,-0.6962281014554186), (s12,0.6719831036132922), (s13,-0.5625688251505582), (s14,-0.30676887025759053), (s15,-0.6426670441973734), (s16,-0.09716223410021836), (s17,-0.6061535537829589), (s18,NaN), (s19,NaN), (s20,0.6294284994377392), (s21,0.6356620421802835), (label1,-0.5665958821050425), (label2,-0.548191636440298), (a1,0.040592887198906136), (sd1,NaN), (a2,-0.7364292...

当然,每个地图迭代都会提交一份工作,Statistics.corr可能正是我想要的?

Which of course is submitting one job per map iteration, Statistics.corr might be what I am looking for?

推荐答案

Statistics.corr 在这里看起来像是正确的选择.您可能会考虑的另一个选项是 RowMatrix.columnSimilarities (列之间的余弦相似度,可以选择使用带阈值采样的优化版本)(可选)和 RowMatrix.computeCovariance .您必须先将一种或多种方式将数据组合成 Vectors .假设列已经是 DoubleType ,则可以使用 VectorAssembler :

Statistics.corr looks like correct choice here. Another options you may consider are RowMatrix.columnSimilarities (cosine similarities between columns, optionally with optimized version which uses sampling with threshold) and RowMatrix.computeCovariance. One way or another you'll have to assemble your data into Vectors first. Assuming columns are already of DoubleType you can use VectorAssembler:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.Vector

val df: DataFrame = ??? 

val assembler = new VectorAssembler()
  .setInputCols(df.columns.diff(Seq("id","maxcykle","rul")))
  .setOutputCol("features")

val rows = assembler.transform(df)
  .select($"features")
  .rdd
  .map(_.getAs[Vector]("features"))

接下来,您可以使用 Statistics.corr

import org.apache.spark.mllib.stat.Statistics

Statistics.corr(rows)

或转换为 RowMatrix :

import org.apache.spark.mllib.linalg.distributed.RowMatrix

val mat = new RowMatrix(rows)

mat.columnSimilarities(0.75)

这篇关于Spark 1.6 Pearson相关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆