在scala中查找数据集中的重复项 [英] Finding out duplicates in a dataset in scala

查看:579
本文介绍了在scala中查找数据集中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,它是一个String的数据集,它有数据

I have a dataset which is a DataSet of String and it has the data

12348,5,233,234559,4
12348,5,233,234559,4
12349,6,233,234560,5
12350,7,233,234561,6

我想弄清数据集中的重复行,我该怎么做?我想删除重复的。在示例中,重复的行是 12348,5,233,234559,4 ,我想输出一个单一的实例

I want to figure out the duplicate rows in a dataset, how do i do that? I would like to remove the duplicates. in the example, the duplicated row is 12348,5,233,234559,4 and I want to output just a single instance of it

我如何去做?

推荐答案

Dimas答案应该可以工作。这是另一个解决方案。

Dimas answer should work. Here is another solution.

认为(不正面) groupby 将保留所有的数据在内存..所以也许这将是更好的你。

I think (not positive) groupby would hold all of the data in memory.. so perhaps this would be better for you.

val rows = scala.io.Source.fromFile("data.txt") // Assuming data is in a file
             .getLines  // Create an iterator from lines in file
             .foldLeft(Map.empty[String, Int]){ // Fold over empty Map
                (acc, row) => acc + (row -> (acc.getOrElse(row, 0) + 1))}  // Keep accumulator to track of row counts as fold is done
             .filter(t => t._2 > 1)  // Filter to tuples with more than one row

我是新来的scala自己,我实际上花了一会儿回答这个练习哈哈。令人困惑,但这是有道理的!

I'm new to scala myself, I actually spent a while answering this as practice haha. Confusing, but it makes sense!

想像一个地图像字典。您可以在其中存储对。在scala中,您可以通过添加一个对来添加/更新键/值对。
Map(b - > 4)+(c - > 2)将返回 Map(b - > 4 ,c - > 2)。扩展, Map(b - > 4,c - > 2)+(b - > 1)返回 Map b - > 1,c - > 2)。由于迭代器被折叠,所以acc(从计数更名为更清楚)是增长对象的累加器。每次它触及一个新行时,它都会检查该行是否在Map中(再次,认为字典)。如果值在那里,它将使用 getOrElse 的前一个值,并向其添加1,然后使用该新对更新acc Map,或者将其初始化为一个,如果不存在(因为这是第一次看到)。

Think of a Map like a dictionary. You can store pairs in it. In scala, you can add/update a key/value pair by adding a pair to it. Map(b -> 4) + ("c" -> 2) would return Map(b -> 4, c -> 2). Expanding on that, Map(b -> 4, c -> 2) + ("b" -> 1) returns Map(b -> 1, c -> 2). What acc is (renamed from count for clarity) is the accumulator of a growing object as the iterator is folded. Each time it hits a new row, it is checking to see if that row has is in the Map yet (again, think dictionary). If the value is there, it takes the previous value with getOrElse and adds 1 to it, then updates the acc Map with that new pair, or it initializes it at one if it doesn't exist yet (since it was the first time the row was seen).

这是我发现学习折叠的最好的博客。作者简洁而准确地描述了这一点: https:// coderwall。 com / p / 4l73-a / scala-fold-foldleft-and-foldright

Here is the best blog I found for learning folding. The author describes it succinctly and accurately: https://coderwall.com/p/4l73-a/scala-fold-foldleft-and-foldright

这篇关于在scala中查找数据集中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆