计算PySpark DataFrame列的模式? [英] Calculate the mode of a PySpark DataFrame column?

查看:253
本文介绍了计算PySpark DataFrame列的模式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于DataFrame中的所有列,我最终想要的是列的模式.对于其他摘要统计信息,我看到了两个选择:使用DataFrame聚合,或将DataFrame的列映射到矢量的RDD(这也是我遇到的麻烦),并使用MLlib中的colStats.但是我不认为模式是一种选择.

Ultimately what I want is the mode of a column, for all the columns in the DataFrame. For other summary statistics, I see a couple of options: use DataFrame aggregation, or map the columns of the DataFrame to an RDD of vectors (something I'm also having trouble doing) and use colStats from MLlib. But I don't see mode as an option there.

推荐答案

模式问题与中位数几乎相同.尽管很容易计算,但计算却相当昂贵.既可以使用排序,再进行本地和全局聚合,也可以使用另一个单词计数和过滤器来完成此操作:

A problem with mode is pretty much the same as with median. While it is easy to compute, computation is rather expensive. It can be done either using sort followed by local and global aggregations or using just-another-wordcount and filter:

import numpy as np
np.random.seed(1)

df = sc.parallelize([
    (int(x), ) for x in np.random.randint(50, size=10000)
]).toDF(["x"])

cnts = df.groupBy("x").count()
mode = cnts.join(
    cnts.agg(max("count").alias("max_")), col("count") == col("max_")
).limit(1).select("x")
mode.first()[0]
## 0

无论哪种方式,它都可能需要对每一列进行全面的改组.

Either way it may require a full shuffle for each column.

这篇关于计算PySpark DataFrame列的模式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆