将多个 groupBy 函数合并为 1 [英] Combining multiple groupBy functions into 1

查看:20
本文介绍了将多个 groupBy 函数合并为 1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用此代码查找模态:

将 numpy 导入为 npnp.random.seed(1)df2 = sc.parallelize([(int(x), ) for x in np.random.randint(50, size=10000)]).toDF(["x"])cnts = df2.groupBy("x").count()模式 = cnts.join(cnts.agg(max("count").alias("max_")), col("count") == col("max_")).limit(1).select("x")mode.first()[0]

来自

c1 的模态 &c2 分别是 2.0 和 3.0

这是否可以应用于数据框中的所有列 c1,c2,c3,c4,c5 而不是像我所做的那样明确选择每一列?

解决方案

看起来您使用的是内置 max,而不是 SQL 函数.

import pyspark.sql.functions as Fcnts.agg(F.max("count").alias("max_"))

要在相同类型的多个列上查找模式,您可以将其整形为长(melt,如中所定义)Apache Spark 中的 Pandas Melt 函数):

(melt(df, [], df.columns)# 按列和值计数.groupBy("变量", "值").数数()# 每列查找模式.groupBy("变量").agg(F.max(F.struct("count", "value")).alias("mode")).select("变量", "mode.value"))

+--------+-----+|变量|值|+--------+-----+|c5|6.0||c1|2.0||c4|5.0||c3|4.0||c2|3.0|+--------+-----+

Using this code to find modal :

import numpy as np
np.random.seed(1)

df2 = sc.parallelize([
    (int(x), ) for x in np.random.randint(50, size=10000)
]).toDF(["x"])

cnts = df2.groupBy("x").count()
mode = cnts.join(
    cnts.agg(max("count").alias("max_")), col("count") == col("max_")
).limit(1).select("x")
mode.first()[0]

from Calculate the mode of a PySpark DataFrame column?

returns error :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-53-2a9274e248ac> in <module>()
      8 cnts = df.groupBy("x").count()
      9 mode = cnts.join(
---> 10     cnts.agg(max("count").alias("max_")), col("count") == col("max_")
     11 ).limit(1).select("x")
     12 mode.first()[0]

AttributeError: 'str' object has no attribute 'alias'

Instead of this solution I'm attempting this custom one:

df.show()

cnts = df.groupBy("c1").count()
print cnts.rdd.map(tuple).sortBy(lambda a: a[1], ascending=False).first()

cnts = df.groupBy("c2").count()
print cnts.rdd.map(tuple).sortBy(lambda a: a[1] , ascending=False).first()

which returns :

So modal of c1 & c2 are 2.0 and 3.0 respectively

Can this be applied to all columns c1,c2,c3,c4,c5 in dataframe instead of explicitly selecting each column as I have done ?

解决方案

It looks like you're using built-in max, not a SQL function.

import pyspark.sql.functions as F

cnts.agg(F.max("count").alias("max_"))

To find mode over multiple columns of the same type you can reshape to long (melt as defined in Pandas Melt function in Apache Spark):

(melt(df, [], df.columns)
    # Count by column and value
    .groupBy("variable", "value")
    .count()
    # Find mode per column
    .groupBy("variable")
    .agg(F.max(F.struct("count", "value")).alias("mode"))
    .select("variable", "mode.value"))

+--------+-----+
|variable|value|
+--------+-----+
|      c5|  6.0|
|      c1|  2.0|
|      c4|  5.0|
|      c3|  4.0|
|      c2|  3.0|
+--------+-----+

这篇关于将多个 groupBy 函数合并为 1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆