如何使用pyspark在数据框中获取不同的行? [英] How to get distinct rows in dataframe using pyspark?

查看:83
本文介绍了如何使用pyspark在数据框中获取不同的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这只是一个非常简单的问题,很可能已经在某处得到了回答,但是作为初学者,我仍然不明白,并且正在寻找您的启发,请先谢谢您:

I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don't get it and am looking for your enlightenment, thank you in advance:

我有一个临时数据框:

+----------------------------+---+
|host                        |day|
+----------------------------+---+
|in24.inetnebr.com           |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|ix-esc-ca2-07.ix.netcom.com |1  |
|uplherc.upl.com             |1  |

我需要删除主机列中的所有多余项,换句话说,我需要得到最终的独特结果,例如:

What I need is to remove all the redundant items in host column, in another word, I need to get the final distinct result like:

+----------------------------+---+
|host                        |day|
+----------------------------+---+
|in24.inetnebr.com           |1  |
|uplherc.upl.com             |1  |
|ix-esc-ca2-07.ix.netcom.com |1  |
|uplherc.upl.com             |1  |


推荐答案

如果 df 是DataFrame的名称,有两种获取唯一行的方法:

If df is the name of your DataFrame, there are two ways to get unique rows:

df2 = df.distinct()

df2 = df.drop_duplicates()

这篇关于如何使用pyspark在数据框中获取不同的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆