基于在RDD /星火数据框中特定列从行中删除重复项 [英] Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

查看：649 发布时间：2016/5/22 15:52:28 apache-spark apache-spark-sql pyspark

本文介绍了基于在RDD /星火数据框中特定列从行中删除重复项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

让我们说我有以下形式一个相当大的数据集：

Let's say I have a rather large dataset in the following form:

data = sc.parallelize([('Foo',41,'US',3),
                       ('Foo',39,'UK',1),
                       ('Bar',57,'CA',2),
                       ('Bar',72,'CA',2),
                       ('Baz',22,'US',6),
                       ('Baz',36,'US',6)])

我想这样做的是去除的基础上只有第一，第三和第四列的值重复的行。

What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only.

完全去除重复行很简单：

Removing entirely duplicate rows is straightforward:

data = data.distinct()

和任何一列5或行6都将被删除。

and either row 5 or row 6 will be removed

但我怎么只能去除基于3列1，只有4重复行？即删除其中任一一种：

But how do I only remove duplicate rows based on columns 1, 3 and 4 only? i.e. remove either one one of these:

('Baz',22,'US',6)
('Baz',36,'US',6)

在Python中，这可以通过与 .drop_duplicates指定列完成（）。我怎样才能达到同样的星火/ Pyspark？

In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/Pyspark?

推荐答案

从你的问题，它是对要使用，以确定重复的列尚不清楚。该解决方案背后的总体思路是建立基于识别重复列的值的键。然后，您可以使用reduceByKey或减少操作，以消除重复。

From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.

下面是一些code，让你开始：

Here is some code to get you started:

def get_key(x):
    return "{0}{1}{2}".format(x[0],x[2],x[3])

m = data.map(lambda x: (get_key(x),x))

现在，你有一个key-value RDD 由列1,3和4键入。
下一步将应该是 reduceByKey 或 groupByKey 和过滤器。
这将消除重复。

Now, you have a key-value RDD that is keyed by columns 1,3 and 4. The next step would be either a reduceByKey or groupByKey and filter. This would eliminate duplicates.

r = m.reduceByKey(lambda x,y: (x))

这篇关于基于在RDD /星火数据框中特定列从行中删除重复项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

基于在RDD /星火数据框中特定列从行中删除重复项 [英] Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

基于在RDD /星火数据框中特定列从行中删除重复项 [英] Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭