在pyspark Dataframe中找到10,000列的均值和corr [英] find mean and corr of 10,000 columns in pyspark Dataframe

查看：295 发布时间：2020/9/4 7:14:41 python apache-spark pyspark spark-dataframe

本文介绍了在pyspark Dataframe中找到10,000列的均值和corr的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有1万列和7000万行的DF.我想计算10K列的均值和corr.我做了下面的代码，但是由于代码大小为64K，它无法正常工作( https://issues. apache.org/jira/browse/SPARK-16845 )

I have DF with 10K columns and 70Million rows. I want to calculate the mean and corr of 10K columns. I did below code but it wont work due to code size 64K issue (https://issues.apache.org/jira/browse/SPARK-16845)

数据:

region dept week sal val1  val2  val3 ... val10000   
 US    CS   1     1    2    1     1   ...  2 
 US    CS   2     1.5  2    3     1   ...  2
 US    CS   3     1    2    2     2.1      2
 US    ELE  1     1.1  2    2     2.1      2
 US    ELE  2     2.1  2    2     2.1      2
 US    ELE  3     1    2    1     2   .... 2
 UE    CS   1     2    2    1     2   .... 2

代码:

aggList =  [func.mean(col) for col in df.columns]  #exclude keys
df2= df.groupBy('region', 'dept').agg(*aggList)

代码2

aggList =  [func.corr('sal', col).alias(col) for col in df.columns]  #exclude keys
df2  = df.groupBy('region', 'dept', 'week').agg(*aggList)

这失败.有没有其他方法可以克服此错误?有人尝试使用10K列的DF吗?关于性能改进有什么建议吗?

this fails. Is there any alternative way to overcome this bug? and any one tried DF with 10K columns?. Is there any suggestion on performance improvement?

在pyspark Dataframe中找到10,000列的均值和corr [英] find mean and corr of 10,000 columns in pyspark Dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在pyspark Dataframe中找到10,000列的均值和corr [英] find mean and corr of 10,000 columns in pyspark Dataframe

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭