关于窗口函数中F.count(F.col("some column").isNotNull())的用法 [英] regarding the usage of F.count(F.col("some column").isNotNull()) in window function
问题描述
我正在尝试在窗口函数中测试F.count(F.col().isNotNull())的用法.请参阅以下代码脚本
I am trying to test the usage of F.count(F.col().isNotNull()) in window function. Please see the following code script
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w)).show()
结果如下所示.在前两行中,我的理解是 F.count(F.col("xyz")
应该将非零项目从 xyz = -infinity
算到 xyz = null
,下面的 isNotNull()
是如何处理的:为什么在 xyz1
列.
The result is shown as follows. In the first two rows, my understanding is that F.count(F.col("xyz")
should count the non-zero items from xyz = -infinity
to xyz = null
, how does the following isNotNull()
process this. Why it gets 2
for the first two rows in xyz1
column.
推荐答案
如果您计算布尔值,因为布尔值是True或False,所以您将计算指定窗口中的所有行,而不管 xyz
是否为空.
If you count the Booleans, since they are either True or False, you will count all the rows in the specified window, regardless of whether xyz
is null or not.
您可以做的是对 isNotNull
布尔值进行 sum
求和,而不是对它们进行计数.
What you could do is to sum
the isNotNull
Boolean rather than counting them.
df.withColumn("xyz1",F.sum(F.col("xyz").isNotNull().cast('int')).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+
另一种方法是使用 when
时进行条件计数:
Another way is to do a conditional count using when
:
df.withColumn("xyz1",F.count(F.when(F.col("xyz").isNotNull(), 1)).over(w)).show()
+----+----+----+----+
|I_id|p_id| xyz|xyz1|
+----+----+----+----+
| 2| 5|null| 0|
| 2| 5|null| 0|
| 2| 5| 1| 1|
| 2| 5| 2| 2|
| 2| 5| 4| 3|
| 1| 5|null| 0|
| 1| 5| 1| 1|
| 1| 5| 4| 3|
| 1| 5| 4| 3|
+----+----+----+----+
这篇关于关于窗口函数中F.count(F.col("some column").isNotNull())的用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!