基于在RDD /星火数据框中特定列从行中删除重复项 [英] Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame
问题描述
让我们说我有以下形式一个相当大的数据集:
Let's say I have a rather large dataset in the following form:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',2),
('Baz',22,'US',6),
('Baz',36,'US',6)])
我想这样做的是去除的基础上只有第一,第三和第四列的值重复的行。
What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.
完全去除重复行很简单:
Removing entirely duplicate rows is straightforward:
data = data.distinct()
和任何一列5或行6都将被删除。
and either row 5 or row 6 will be removed
但我怎么只能去除基于3列1,只有4重复行?即删除其中任一一种:
But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:
('Baz',22,'US',6)
('Baz',36,'US',6)
在Python中,这可以通过与 .drop_duplicates指定列完成()
。我怎样才能达到同样的星火/ Pyspark?
In Python, this could be done by specifying columns with .drop_duplicates()
. How can I achieve the same in Spark/Pyspark?
推荐答案
从你的问题,它是对要使用,以确定重复的列尚不清楚。该解决方案背后的总体思路是建立基于识别重复列的值的键。然后,您可以使用reduceByKey或减少操作,以消除重复。
From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.
下面是一些code,让你开始:
Here is some code to get you started:
def get_key(x):
return "{0}{1}{2}".format(x[0],x[2],x[3])
m = data.map(lambda x: (get_key(x),x))
现在,你有一个key-value RDD
由列1,3和4键入。
下一步将应该是 reduceByKey
或 groupByKey
和过滤器
。
这将消除重复。
Now, you have a key-value RDD
that is keyed by columns 1,3 and 4.
The next step would be either a reduceByKey
or groupByKey
and filter
.
This would eliminate duplicates.
r = m.reduceByKey(lambda x,y: (x))
这篇关于基于在RDD /星火数据框中特定列从行中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!