数据框:如何分组/计数然后在Scala中按计数过滤 [英] dataframe: how to groupBy/count then filter on count in Scala

查看：357 发布时间：2020/9/4 5:38:34 scala apache-spark apache-spark-sql

本文介绍了数据框:如何分组/计数然后在Scala中按计数过滤的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

火花1.4.1

我遇到了一种情况，即按数据框分组，然后对计数"列进行计数和过滤会在下面引发异常

I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below

import sqlContext.implicits._
import org.apache.spark.sql._

case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()

然后分组和过滤:

df.groupBy("x").count()
  .filter("count >= 2")
  .show()

引发异常:

java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2

解决方案:

重命名列会使问题消失(因为我怀疑与内插的计数"函数没有冲突

Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'

df.groupBy("x").count()
  .withColumnRenamed("count", "n")
  .filter("n >= 2")
  .show()

那么，这是一种预期的行为，一个错误还是有一种规范的解决方法?

So, is that a behavior to expect, a bug or is there a canonical way to go around?

谢谢，亚历克斯

推荐答案

将字符串传递给filter函数时，该字符串将解释为SQL. Count是一个SQL关键字，使用count作为变量会混淆解析器.这是一个小错误(如果需要，您可以提交JIRA票证).

When you pass a string to the filter function, the string is interpreted as SQL. Count is a SQL keyword and using count as a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).

您可以通过使用列表达式而不是字符串来轻松避免这种情况:

You can easily avoid this by using a column expression instead of a String:

df.groupBy("x").count()
  .filter($"count" >= 2)
  .show()

这篇关于数据框:如何分组/计数然后在Scala中按计数过滤的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

数据框:如何分组/计数然后在Scala中按计数过滤 [英] dataframe: how to groupBy/count then filter on count in Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

数据框:如何分组/计数然后在Scala中按计数过滤 [英] dataframe: how to groupBy/count then filter on count in Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭