Spark 请求最大计数 [英] Spark request max count
问题描述
我是 Spark 的初学者,我尝试发出一个请求,允许我检索访问量最大的网页.
I'm a beginner on spark and I try to make a request allow me to retrieve the most visited web pages.
我的要求如下
mostPopularWebPageDF = logDF.groupBy("webPage").agg(functions.count("webPage").alias("cntWebPage")).agg(functions.max("cntWebPage")).show()
通过这个请求,我只检索了一个最大计数的数据帧,但我想检索一个具有这个分数的数据帧和保存这个分数的网页
With this request I retrieve only a dataframe with the max count but I want to retrieve a dataframe with this score and the web page that holds this score
类似的东西:
webPage max(cntWebPage)
google.com 2
我该如何解决我的问题?
How can I fix my problem?
非常感谢.
推荐答案
在 pyspark + sql 中:
In pyspark + sql:
logDF.registerTempTable("logDF")
mostPopularWebPageDF = sqlContext.sql("""select webPage, cntWebPage from (
select webPage, count(*) as cntWebPage, max(count(*)) over () as maxcnt
from logDF
group by webPage) as tmp
where tmp.cntWebPage = tmp.maxcnt""")
也许我可以让它更干净,但它有效.我会努力优化它.
Maybe I can make it cleaner, but it works. I will try to optimize it.
我的结果:
webPage cntWebPage
google.com 2
对于数据集:
webPage usersid
google.com 1
google.com 3
bing.com 10
说明:正常计数是通过分组+count(*)函数完成的.所有这些计数的最大值是通过窗口函数计算的,因此对于上面的数据集,直接 DataFrame/不删除 maxCount 列/是:
Explanation: normal counting is done via grouping + count(*) function. Max of all these counts are calculated via window function, so for dataset above, immediate DataFrame /without dropping maxCount column/ is:
webPage count maxCount
google.com 2 2
bing.com 1 2
然后我们选择计数等于 maxCount 的行
Then we select rows with count equal to maxCount
我删除了 DSL 版本 - 它不支持 window over () 并且排序正在改变结果.抱歉这个错误.SQL版本正确
I have deleted DSL version - it does not support window over () and ordering is changing result. Sorry for this bug. SQL version is correct
这篇关于Spark 请求最大计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!