如何获得相关矩阵值pyspark [英] How to get correlation matrix values pyspark

查看:326
本文介绍了如何获得相关矩阵值pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在pyspark 2.2上有如下计算的相关矩阵:

I have a correlation matrix calculated as follow on pyspark 2.2:

from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

datos = sql("""select * from proceso_riesgos.jdgc_bd_train_mn_ingresos""")

Variables_corr= ['ingreso_final_mix','ingreso_final_promedio',
'ingreso_final_mediana','ingreso_final_trimedia','ingresos_serv_q1',
'ingresos_serv_q2','ingresos_serv_q3','prom_ingresos_serv','y_correc']

assembler = VectorAssembler(
inputCols=Variables_corr,
outputCol="features")

datos1=datos.select(Variables_corr).filter("y_correc is not null")
output = assembler.transform(datos)
r1 = Correlation.corr(output, "features")

结果是一个数据帧,该数据帧带有一个名为"pearson(features):matrix"的变量:

the result is a data frame with a variable called "pearson(features): matrix":

Row(pearson(features)=DenseMatrix(20, 20, [1.0, 0.9428, 0.8908, 0.913, 
0.567, 0.5832, 0.6148, 0.6488, ..., -0.589, -0.6145, -0.5906, -0.5534, 
-0.5346, -0.0797, -0.617, 1.0], False))]

我需要获取这些值并将其导出到Excel,或者能够操纵结果. 列表可能会令人讨厌.

I need to take those values and export it to an excel, or to be able to manipulate the result. A list could be desiderable.

感谢帮助!

推荐答案

您快到了!无需使用旧的rdd mllib api.

You are almost there ! There is no need to use old rdd mllib api .

这是我生成熊猫数据框的方法,可以导出为ex​​cel或csv或其他格式.

This is my method to generate pandas dataframe, you can export to excel or csv or others format.

def correlation_matrix(df, corr_columns, method='pearson'):
    vector_col = "corr_features"
    assembler = VectorAssembler(inputCols=corr_columns, outputCol=vector_col)
    df_vector = assembler.transform(df).select(vector_col)
    matrix = Correlation.corr(df_vector, vector_col, method)

    result = matrix.collect()[0]["pearson({})".format(vector_col)].values
    return pd.DataFrame(result.reshape(-1, len(corr_columns)), columns=corr_columns, index=corr_columns)

这篇关于如何获得相关矩阵值pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆