Spark DataFrame/RDD的前N个项目 [英] Top N items from a Spark DataFrame/RDD
问题描述
我的要求是从数据框中获取前N个项目.
My requirement is to get the top N items from a dataframe.
我有这个DataFrame:
I've this DataFrame:
val df = List(
("MA", "USA"),
("MA", "USA"),
("OH", "USA"),
("OH", "USA"),
("OH", "USA"),
("OH", "USA"),
("NY", "USA"),
("NY", "USA"),
("NY", "USA"),
("NY", "USA"),
("NY", "USA"),
("NY", "USA"),
("CT", "USA"),
("CT", "USA"),
("CT", "USA"),
("CT", "USA"),
("CT", "USA")).toDF("value", "country")
我能够将其映射到 RDD [(((Int,String),Long)]
colValCount:读取:((colIdx,value),count)
I was able to map it to an RDD[((Int, String), Long)]
colValCount:
Read: ((colIdx, value), count)
((0,CT),5)
((0,MA),2)
((0,OH),4)
((0,NY),6)
((1,USA),17)
现在,我需要为每个列索引获取前2个项目.所以我的预期输出是这样:
Now I need to get the top 2 items for each column index. So my expected output is this:
RDD[((Int, String), Long)]
((0,CT),5)
((0,NY),6)
((1,USA),17)
我尝试在DataFrame中使用freqItems api,但是它很慢.
I've tried using freqItems api in DataFrame but it's slow.
欢迎提出任何建议.
推荐答案
最简单的方法(自然的窗口函数)是编写SQL.Spark带有SQL语法,SQL是解决此问题的好工具.
The easiest way to do this - a natural window function - is by writing SQL. Spark comes with SQL syntax, and SQL is a great and expressive tool for this problem.
将您的数据框注册为临时表,然后对其进行分组和窗口显示.
Register your dataframe as a temp table, and then group and window on it.
spark.sql("""SELECT idx, value, ROW_NUMBER() OVER (PARTITION BY idx ORDER BY c DESC) as r
FROM (
SELECT idx, value, COUNT(*) as c
FROM (SELECT 0 as idx, value FROM df UNION ALL SELECT 1, country FROM df)
GROUP BY idx, value)
HAVING r <= 2""").show()
我想看看是否有任何过程/标量方法可以让您执行窗口功能而无需迭代或循环.我不知道Spark API中会支持它的任何东西.
I'd like to see if any of the procedural / scala approaches will let you perform the window function without an iteration or loop. I'm not aware of anything in the Spark API that would support it.
顺便说一句,如果要包含任意数量的列,则可以很容易地生成内部部分( SELECT 0 as idx,value ... UNION ALL SELECT 1,country
,等)动态地使用列列表.
Incidentally, if you have an arbitrary number of columns you want to include then you can quite easily generate the inner section (SELECT 0 as idx, value ... UNION ALL SELECT 1, country
, etc) dynamically using the list of columns.
这篇关于Spark DataFrame/RDD的前N个项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!