Spark DataFrame/RDD的前N个项目 [英] Top N items from a Spark DataFrame/RDD

查看：236 发布时间：2021/4/8 19:26:54 scala apache-spark top-n

本文介绍了Spark DataFrame/RDD的前N个项目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的要求是从数据框中获取前N个项目.

My requirement is to get the top N items from a dataframe.

我有这个DataFrame:

I've this DataFrame:

val df = List(
      ("MA", "USA"),
      ("MA", "USA"),
      ("OH", "USA"),
      ("OH", "USA"),
      ("OH", "USA"),
      ("OH", "USA"),
      ("NY", "USA"),
      ("NY", "USA"),
      ("NY", "USA"),
      ("NY", "USA"),
      ("NY", "USA"),
      ("NY", "USA"),
      ("CT", "USA"),
      ("CT", "USA"),
      ("CT", "USA"),
      ("CT", "USA"),
      ("CT", "USA")).toDF("value", "country")

我能够将其映射到 RDD [(((Int，String)，Long)] colValCount:读取:((colIdx，value)，count)

I was able to map it to an RDD[((Int, String), Long)] colValCount: Read: ((colIdx, value), count)

((0,CT),5)
((0,MA),2)
((0,OH),4)
((0,NY),6)
((1,USA),17)

现在，我需要为每个列索引获取前2个项目.所以我的预期输出是这样:

Now I need to get the top 2 items for each column index. So my expected output is this:

RDD[((Int, String), Long)]

((0,CT),5)
((0,NY),6)
((1,USA),17)

我尝试在DataFrame中使用freqItems api，但是它很慢.

I've tried using freqItems api in DataFrame but it's slow.

欢迎提出任何建议.

推荐答案

最简单的方法(自然的窗口函数)是编写SQL.Spark带有SQL语法，SQL是解决此问题的好工具.

The easiest way to do this - a natural window function - is by writing SQL. Spark comes with SQL syntax, and SQL is a great and expressive tool for this problem.

将您的数据框注册为临时表，然后对其进行分组和窗口显示.

spark.sql("""SELECT idx, value, ROW_NUMBER() OVER (PARTITION BY idx ORDER BY c DESC) as r 
             FROM (
               SELECT idx, value, COUNT(*) as c 
               FROM (SELECT 0 as idx, value FROM df UNION ALL SELECT 1, country FROM df) 
               GROUP BY idx, value) 
             HAVING r <= 2""").show()

我想看看是否有任何过程/标量方法可以让您执行窗口功能而无需迭代或循环.我不知道Spark API中会支持它的任何东西.

I'd like to see if any of the procedural / scala approaches will let you perform the window function without an iteration or loop. I'm not aware of anything in the Spark API that would support it.

顺便说一句，如果要包含任意数量的列，则可以很容易地生成内部部分( SELECT 0 as idx，value ... UNION ALL SELECT 1，country ，等)动态地使用列列表.

Incidentally, if you have an arbitrary number of columns you want to include then you can quite easily generate the inner section (SELECT 0 as idx, value ... UNION ALL SELECT 1, country, etc) dynamically using the list of columns.

这篇关于Spark DataFrame/RDD的前N个项目的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark DataFrame/RDD的前N个项目 [英] Top N items from a Spark DataFrame/RDD

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark DataFrame/RDD的前N个项目 [英] Top N items from a Spark DataFrame/RDD

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭