Spark DataFrame/RDD的前N个项目 [英] Top N items from a Spark DataFrame/RDD

查看:236
本文介绍了Spark DataFrame/RDD的前N个项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的要求是从数据框中获取前N个项目.

My requirement is to get the top N items from a dataframe.

我有这个DataFrame:

I've this DataFrame:

val df = List(
      ("MA", "USA"),
      ("MA", "USA"),
      ("OH", "USA"),
      ("OH", "USA"),
      ("OH", "USA"),
      ("OH", "USA"),
      ("NY", "USA"),
      ("NY", "USA"),
      ("NY", "USA"),
      ("NY", "USA"),
      ("NY", "USA"),
      ("NY", "USA"),
      ("CT", "USA"),
      ("CT", "USA"),
      ("CT", "USA"),
      ("CT", "USA"),
      ("CT", "USA")).toDF("value", "country")

我能够将其映射到 RDD [(((Int,String),Long)] colValCount:读取:((colIdx,value),count)

I was able to map it to an RDD[((Int, String), Long)] colValCount: Read: ((colIdx, value), count)

((0,CT),5)
((0,MA),2)
((0,OH),4)
((0,NY),6)
((1,USA),17)

现在,我需要为每个列索引获取前2个项目.所以我的预期输出是这样:

Now I need to get the top 2 items for each column index. So my expected output is this:

RDD[((Int, String), Long)]

((0,CT),5)
((0,NY),6)
((1,USA),17)

我尝试在DataFrame中使用freqItems api,但是它很慢.

I've tried using freqItems api in DataFrame but it's slow.

欢迎提出任何建议.

推荐答案

最简单的方法(自然的窗口函数)是编写SQL.Spark带有SQL语法,SQL是解决此问题的好工具.

The easiest way to do this - a natural window function - is by writing SQL. Spark comes with SQL syntax, and SQL is a great and expressive tool for this problem.

将您的数据框注册为临时表,然后对其进行分组和窗口显示.

Register your dataframe as a temp table, and then group and window on it.

spark.sql("""SELECT idx, value, ROW_NUMBER() OVER (PARTITION BY idx ORDER BY c DESC) as r 
             FROM (
               SELECT idx, value, COUNT(*) as c 
               FROM (SELECT 0 as idx, value FROM df UNION ALL SELECT 1, country FROM df) 
               GROUP BY idx, value) 
             HAVING r <= 2""").show()

我想看看是否有任何过程/标量方法可以让您执行窗口功能而无需迭代或循环.我不知道Spark API中会支持它的任何东西.

I'd like to see if any of the procedural / scala approaches will let you perform the window function without an iteration or loop. I'm not aware of anything in the Spark API that would support it.

顺便说一句,如果要包含任意数量的列,则可以很容易地生成内部部分( SELECT 0 as idx,value ... UNION ALL SELECT 1,country ,等)动态地使用列列表.

Incidentally, if you have an arbitrary number of columns you want to include then you can quite easily generate the inner section (SELECT 0 as idx, value ... UNION ALL SELECT 1, country, etc) dynamically using the list of columns.

这篇关于Spark DataFrame/RDD的前N个项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆