基于在RDD /星火数据框中特定列从行中删除重复项 [英] Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

查看:649
本文介绍了基于在RDD /星火数据框中特定列从行中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们说我有以下形式一个相当大的数据集:

Let's say I have a rather large dataset in the following form:

data = sc.parallelize([('Foo',41,'US',3),
                       ('Foo',39,'UK',1),
                       ('Bar',57,'CA',2),
                       ('Bar',72,'CA',2),
                       ('Baz',22,'US',6),
                       ('Baz',36,'US',6)])

我想这样做的是去除的基础上只有第一,第三和第四列的值重复的行。

What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.

完全去除重复行很简单:

Removing entirely duplicate rows is straightforward:

data = data.distinct()

和任何一列5或行6都将被删除。

and either row 5 or row 6 will be removed

但我怎么只能去除基于3列1,只有4重复行?即删除其中任一一种:

But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:

('Baz',22,'US',6)
('Baz',36,'US',6)

在Python中,这可以通过与 .drop_duplicates指定列完成()。我怎样才能达到同样的星火/ Pys​​park?

In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/Pyspark?

推荐答案

从你的问题,它是对要使用,以确定重复的列尚不清楚。该解决方案背后的总体思路是建立基于识别重复列的值的键。然后,您可以使用reduceByKey或减少操作,以消除重复。

From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.

下面是一些code,让你开始:

Here is some code to get you started:

def get_key(x):
    return "{0}{1}{2}".format(x[0],x[2],x[3])

m = data.map(lambda x: (get_key(x),x))

现在,你有一个key-value RDD 由列1,3和4键入。
下一步将应该是 reduceByKey groupByKey 过滤器
这将消除重复。

Now, you have a key-value RDD that is keyed by columns 1,3 and 4. The next step would be either a reduceByKey or groupByKey and filter. This would eliminate duplicates.

r = m.reduceByKey(lambda x,y: (x))

这篇关于基于在RDD /星火数据框中特定列从行中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆