Spark 请求最大计数 [英] Spark request max count

查看：29 发布时间：2021/11/14 23:01:48 python apache-spark pyspark-sql

本文介绍了Spark 请求最大计数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 Spark 的初学者，我尝试发出一个请求，允许我检索访问量最大的网页.

I'm a beginner on spark and I try to make a request allow me to retrieve the most visited web pages.

我的要求如下

mostPopularWebPageDF = logDF.groupBy("webPage").agg(functions.count("webPage").alias("cntWebPage")).agg(functions.max("cntWebPage")).show()

通过这个请求，我只检索了一个最大计数的数据帧，但我想检索一个具有这个分数的数据帧和保存这个分数的网页

With this request I retrieve only a dataframe with the max count but I want to retrieve a dataframe with this score and the web page that holds this score

类似的东西:

webPage            max(cntWebPage)
google.com         2

我该如何解决我的问题?

How can I fix my problem?

非常感谢.

推荐答案

在 pyspark + sql 中:

In pyspark + sql:

logDF.registerTempTable("logDF")

mostPopularWebPageDF = sqlContext.sql("""select webPage, cntWebPage from (
                                            select webPage, count(*) as cntWebPage, max(count(*)) over () as maxcnt 
                                            from logDF 
                                            group by webPage) as tmp
                                            where tmp.cntWebPage = tmp.maxcnt""")

也许我可以让它更干净，但它有效.我会努力优化它.

Maybe I can make it cleaner, but it works. I will try to optimize it.

我的结果:

webPage      cntWebPage
google.com   2

对于数据集:

webPage    usersid
google.com 1
google.com 3
bing.com   10

说明:正常计数是通过分组+count(*)函数完成的.所有这些计数的最大值是通过窗口函数计算的，因此对于上面的数据集，直接 DataFrame/不删除 maxCount 列/是:

Explanation: normal counting is done via grouping + count(*) function. Max of all these counts are calculated via window function, so for dataset above, immediate DataFrame /without dropping maxCount column/ is:

webPage    count  maxCount
google.com 2      2
bing.com   1      2

然后我们选择计数等于 maxCount 的行

Then we select rows with count equal to maxCount

我删除了 DSL 版本 - 它不支持 window over () 并且排序正在改变结果.抱歉这个错误.SQL版本正确

I have deleted DSL version - it does not support window over () and ordering is changing result. Sorry for this bug. SQL version is correct

这篇关于Spark 请求最大计数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark 请求最大计数 [英] Spark request max count

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark 请求最大计数 [英] Spark request max count

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭