在 PySpark 数据帧聚合中计数包括空值 [英] Count including null in PySpark Dataframe Aggregation

查看:71
本文介绍了在 PySpark 数据帧聚合中计数包括空值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 agg 和 count 对 DataFrame 进行计数.

I am trying to get some counts on a DataFrame using agg and count.

from pyspark.sql import Row ,functions as F
row = Row("Cat","Date")
df = (sc.parallelize
  ([
        row("A",'2017-03-03'),
        row('A',None),
        row('B','2017-03-04'),
        row('B','Garbage'),
        row('A','2016-03-04')
]).toDF())
df = df.withColumn("Casted", df['Date'].cast('date'))
df.show()

(
df.groupby(df['Cat'])
.agg
(
    #F.count(col('Date').isNull() | col('Date').isNotNull()).alias('Date_Count'),
    F.count('Date').alias('Date_Count'),
    F.count('Casted').alias('Valid_Date_Count')
)    
.show()

)

函数 F.count() 只给我非空计数.除了使用OR"条件之外,有没有办法获得包含空值的计数.

The function F.count() is giving me only the non-null count. Is there a way to get the count including nulls other than using an 'OR' condition.

无效计数似乎不起作用.&条件看起来不像预期的那样工作.

The invalid count doesn't seem to work. The & condition doesn't look to be working as expected.

(
 df
 .groupby(df['Cat'])
.agg
 (
  F.count('*').alias('count'),    
  F.count('Date').alias('Date_Count'),
  F.count('Casted').alias('Valid_Date_Count'),
  F.count(col('Date').isNotNull() & col('Casted').isNull()).alias('invalid')
 )    
.show()
)

推荐答案

Cast the boolean expression as an int and sum it

Cast the boolean expression as an int and sum it

df\
    .groupby(df['Cat'])\
    .agg ( 
        F.count('Date').alias('Date_Count'), 
        F.count('Casted').alias('Valid_Date_Count'), 
        F.sum((~F.isnull('Date')&F.isnull("Casted")).cast("int")).alias("Invalid_Date_Cound")
    ).show()

    +---+----------+----------------+------------------+
    |Cat|Date_Count|Valid_Date_Count|Invalid_Date_Cound|
    +---+----------+----------------+------------------+
    |  B|         2|               1|                 1|
    |  A|         2|               2|                 0|
    +---+----------+----------------+------------------+

这篇关于在 PySpark 数据帧聚合中计数包括空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆