如何使用pyspark在数据框中获取不同的行? [英] How to get distinct rows in dataframe using pyspark?
本文介绍了如何使用pyspark在数据框中获取不同的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我知道这只是一个非常简单的问题,很可能已经在某处得到了回答,但是作为初学者,我仍然不明白,并且正在寻找您的启发,请先谢谢您:
I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don't get it and am looking for your enlightenment, thank you in advance:
我有一个临时数据框:
+----------------------------+---+
|host |day|
+----------------------------+---+
|in24.inetnebr.com |1 |
|uplherc.upl.com |1 |
|uplherc.upl.com |1 |
|uplherc.upl.com |1 |
|uplherc.upl.com |1 |
|ix-esc-ca2-07.ix.netcom.com |1 |
|uplherc.upl.com |1 |
我需要删除主机列中的所有多余项,换句话说,我需要得到最终的独特结果,例如:
What I need is to remove all the redundant items in host column, in another word, I need to get the final distinct result like:
+----------------------------+---+
|host |day|
+----------------------------+---+
|in24.inetnebr.com |1 |
|uplherc.upl.com |1 |
|ix-esc-ca2-07.ix.netcom.com |1 |
|uplherc.upl.com |1 |
推荐答案
如果 df 是DataFrame的名称,有两种获取唯一行的方法:
If df is the name of your DataFrame, there are two ways to get unique rows:
df2 = df.distinct()
或
df2 = df.drop_duplicates()
这篇关于如何使用pyspark在数据框中获取不同的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文