删除重复项 [英] Removing duplicates

查看:92
本文介绍了删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从CSV文件中的数据中删除重复项. 第一列是年份,第二列是句子.无论年份信息如何,我都希望删除句子中的所有重复项.

I would like to remove duplicates from my data in my CSV file. The first column is the year, and the second is the sentence. I would like to remove any duplicates of a sentence, regardless of the year information.

有没有可以在val text = {}中插入的命令来删除这些重复项?

Is there a command that I can insert in val text = { } to remove these dupes?

我的脚本是:

val source = CSVFile("science.csv");

val text = {
source ~>                              
Column(2) ~>                           
TokenizeWith(tokenizer) ~>             
TermCounter() ~>                       
TermMinimumDocumentCountFilter(30) ~>  
TermDynamicStopListFilter(10) ~>      
DocumentMinimumLengthFilter(5)         
} 

谢谢!

推荐答案

本质上,您需要一个distinct版本,可以在其中指定使对象(行)唯一的原因(第二列).

Essentially you want a version of distinct where you can specify what makes an object (row) unique (the second column).

给出代码:(修改后的SeqLike.distinct)

Given the code: (modified SeqLike.distinct)

type Row = (Int, String)
def distinct(rows:Seq[Row], f: Row => AnyRef) = {
   val b = newBuilder
   val seen = mutable.HashSet[AnyRef]()
   val key = f(x)
   for (x <- rows) {
     if (!seen(key)) {
       b += x
       seen += key
     }
   }
   b.result
 }

如果您有一个行列表(其中一行是一个元组),则可以使用第二行基于

If you had a list of rows (where a row is a tuple) you could get the filtered/unique ones based on the second column with

distinct(rows, (_._2))

这篇关于删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆