获取Pyspark数据框最大值的更有效方法 [英] A more efficient way of getting the nlargest values of a Pyspark Dataframe

查看：209 发布时间：2021/4/8 19:41:48 apache-spark pyspark

本文介绍了获取Pyspark数据框最大值的更有效方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试获取数据框的一列的前5个值.

I am trying to get the top 5 values of a column of my dataframe.

以下是数据框的示例.实际上，原始数据帧具有数千行.

A sample of the dataframe is given below. In fact the original dataframe has thousands of rows.

Row(item_id=u'2712821', similarity=5.0)
Row(item_id=u'1728166', similarity=6.0)
Row(item_id=u'1054467', similarity=9.0)
Row(item_id=u'2788825', similarity=5.0)
Row(item_id=u'1128169', similarity=1.0)
Row(item_id=u'1053461', similarity=3.0)

我想出的解决方案是对所有数据框进行排序，然后采用前5个值.(下面的代码可以做到这一点)

The solution I came up with is to sort all of the dataframe and then to take the first 5 values. (the code below does that)

items_of_common_users.sort(items_of_common_users.similarity.desc()).take(5)

我想知道是否有更快的方法来实现这一目标.谢谢

I am wondering if there is a faster way of achieving this. Thanks

推荐答案

您可以将 RDD.top 方法与 key 一起使用:

You can use RDD.top method with key:

from operator import attrgetter

df.rdd.top(5, attrgetter("similarity"))

从 DataFrame 到 RDD 的转换有相当大的开销，但这应该是值得的.

There is a significant overhead of DataFrame to RDD conversion but it should be worth it.

这篇关于获取Pyspark数据框最大值的更有效方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

获取Pyspark数据框最大值的更有效方法 [英] A more efficient way of getting the nlargest values of a Pyspark Dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

获取Pyspark数据框最大值的更有效方法 [英] A more efficient way of getting the nlargest values of a Pyspark Dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭