删除重复项 [英] Removing duplicates

查看：92 发布时间：2020/6/12 19:38:53 scala csv duplicates

本文介绍了删除重复项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从CSV文件中的数据中删除重复项. 第一列是年份，第二列是句子.无论年份信息如何，我都希望删除句子中的所有重复项.

I would like to remove duplicates from my data in my CSV file. The first column is the year, and the second is the sentence. I would like to remove any duplicates of a sentence, regardless of the year information.

有没有可以在val text = {}中插入的命令来删除这些重复项?

Is there a command that I can insert in val text = { } to remove these dupes?

我的脚本是:

val source = CSVFile("science.csv");

val text = {
source ~>                              
Column(2) ~>                           
TokenizeWith(tokenizer) ~>             
TermCounter() ~>                       
TermMinimumDocumentCountFilter(30) ~>  
TermDynamicStopListFilter(10) ~>      
DocumentMinimumLengthFilter(5)         
}

谢谢！

推荐答案

本质上，您需要一个distinct版本，可以在其中指定使对象(行)唯一的原因(第二列).

Essentially you want a version of distinct where you can specify what makes an object (row) unique (the second column).

给出代码:(修改后的SeqLike.distinct)

Given the code: (modified SeqLike.distinct)

type Row = (Int, String)
def distinct(rows:Seq[Row], f: Row => AnyRef) = {
   val b = newBuilder
   val seen = mutable.HashSet[AnyRef]()
   val key = f(x)
   for (x <- rows) {
     if (!seen(key)) {
       b += x
       seen += key
     }
   }
   b.result
 }

如果您有一个行列表(其中一行是一个元组)，则可以使用第二行基于

If you had a list of rows (where a row is a tuple) you could get the filtered/unique ones based on the second column with

distinct(rows, (_._2))

这篇关于删除重复项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

删除重复项 [英] Removing duplicates

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

删除重复项 [英] Removing duplicates

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭