数据框:如何分组/计数然后在 Scala 中过滤计数 [英] dataframe: how to groupBy/count then filter on count in Scala
问题描述
火花 1.4.1
我遇到过按数据框分组,然后对计数"列进行计数和过滤会引发以下异常的情况
I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below
import sqlContext.implicits._
import org.apache.spark.sql._
case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()
然后分组和过滤:
df.groupBy("x").count()
.filter("count >= 2")
.show()
抛出异常:
java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2
解决方案:
重命名列会使问题消失(因为我怀疑与插入的计数"函数没有冲突
Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'
df.groupBy("x").count()
.withColumnRenamed("count", "n")
.filter("n >= 2")
.show()
那么,这是预期的行为、错误还是有规范的方法?
So, is that a behavior to expect, a bug or is there a canonical way to go around?
谢谢,亚历克斯
推荐答案
当您将字符串传递给 filter
函数时,该字符串将被解释为 SQL.Count 是一个 SQL 关键字,使用 count
作为变量会混淆解析器.这是一个小错误(如果您愿意,可以提交 JIRA 票证).
When you pass a string to the filter
function, the string is interpreted as SQL. Count is a SQL keyword and using count
as a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).
您可以通过使用列表达式而不是字符串来轻松避免这种情况:
You can easily avoid this by using a column expression instead of a String:
df.groupBy("x").count()
.filter($"count" >= 2)
.show()
这篇关于数据框:如何分组/计数然后在 Scala 中过滤计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!