关于窗口函数中F.count(F.col("some column").isNotNull())的用法 [英] regarding the usage of F.count(F.col("some column").isNotNull()) in window function

查看:153
本文介绍了关于窗口函数中F.count(F.col("some column").isNotNull())的用法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在窗口函数中测试F.count(F.col().isNotNull())的用法.请参阅以下代码脚本

I am trying to test the usage of F.count(F.col().isNotNull()) in window function. Please see the following code script

from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
list=([1,5,4],
    [1,5,None],
    [1,5,1],
    [1,5,4],
    [2,5,1],
    [2,5,2],
    [2,5,None],
    [2,5,None],
     [2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w)).show()

结果如下所示.在前两行中,我的理解是 F.count(F.col("xyz")应该将非零项目从 xyz = -infinity 算到 xyz = null ,下面的 isNotNull()是如何处理的:为什么在 xyz1 列.

The result is shown as follows. In the first two rows, my understanding is that F.count(F.col("xyz") should count the non-zero items from xyz = -infinity to xyz = null, how does the following isNotNull() process this. Why it gets 2 for the first two rows in xyz1 column.

推荐答案

如果您计算布尔值,因为布尔值是True或False,所以您将计算指定窗口中的所有行,而不管 xyz 是否为空.

If you count the Booleans, since they are either True or False, you will count all the rows in the specified window, regardless of whether xyz is null or not.

您可以做的是对 isNotNull 布尔值进行 sum 求和,而不是对它们进行计数.

What you could do is to sum the isNotNull Boolean rather than counting them.

df.withColumn("xyz1",F.sum(F.col("xyz").isNotNull().cast('int')).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
|   2|   5|null|   0|
|   2|   5|null|   0|
|   2|   5|   1|   1|
|   2|   5|   2|   2|
|   2|   5|   4|   3|
|   1|   5|null|   0|
|   1|   5|   1|   1|
|   1|   5|   4|   3|
|   1|   5|   4|   3|
+----+----+----+----+

另一种方法是使用 when 时进行条件计数:

Another way is to do a conditional count using when:

df.withColumn("xyz1",F.count(F.when(F.col("xyz").isNotNull(), 1)).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
|   2|   5|null|   0|
|   2|   5|null|   0|
|   2|   5|   1|   1|
|   2|   5|   2|   2|
|   2|   5|   4|   3|
|   1|   5|null|   0|
|   1|   5|   1|   1|
|   1|   5|   4|   3|
|   1|   5|   4|   3|
+----+----+----+----+

这篇关于关于窗口函数中F.count(F.col("some column").isNotNull())的用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆