无法使用Spark中的Window函数过滤DataFrame [英] Unable to filter DataFrame using Window function in Spark
问题描述
我尝试使用基于窗口函数的逻辑表达式来检测重复记录:
I try to use a logical expression based on a window-function to detect duplicate records:
df
.where(count("*").over(Window.partitionBy($"col1",$"col2"))>lit(1))
.show
这提供了Spark 2.1.1:
this gives in Spark 2.1.1:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate
另一方面,如果我将window函数的结果分配给新列,然后对该列进行过滤,则它可以工作:
on the other hand, it works if i assign the result of the window-function to a new column and then filter that column:
df
.withColumn("count", count("*").over(Window.partitionBy($"col1",$"col2"))
.where($"count">lit(1)).drop($"count")
.show
我想知道如何在不使用临时列的情况下编写此代码?
I wonder how I can write this without using an temporary column?
推荐答案
我猜想窗口函数不能在过滤器内使用.您必须创建另一列并对此进行过滤.
I guess window functions cannot be used within the filter. You have to create an additional column and filter this one.
您可以做的是将window函数绘制到select中.
What you could do is to draw the window function into the select.
df.select(col("1"), col("2"), lag(col("2"), 1).over(window).alias("2_lag"))).filter(col("2_lag")==col("2"))
然后将其包含在一个语句中.
Then you have it in one statement.
这篇关于无法使用Spark中的Window函数过滤DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!