删除重复项 [英] Removing duplicates
问题描述
我想从CSV文件中的数据中删除重复项. 第一列是年份,第二列是句子.无论年份信息如何,我都希望删除句子中的所有重复项.
I would like to remove duplicates from my data in my CSV file. The first column is the year, and the second is the sentence. I would like to remove any duplicates of a sentence, regardless of the year information.
有没有可以在val text = {}中插入的命令来删除这些重复项?
Is there a command that I can insert in val text = { } to remove these dupes?
我的脚本是:
val source = CSVFile("science.csv");
val text = {
source ~>
Column(2) ~>
TokenizeWith(tokenizer) ~>
TermCounter() ~>
TermMinimumDocumentCountFilter(30) ~>
TermDynamicStopListFilter(10) ~>
DocumentMinimumLengthFilter(5)
}
谢谢!
推荐答案
本质上,您需要一个distinct版本,可以在其中指定使对象(行)唯一的原因(第二列).
Essentially you want a version of distinct where you can specify what makes an object (row) unique (the second column).
给出代码:(修改后的SeqLike.distinct)
Given the code: (modified SeqLike.distinct)
type Row = (Int, String)
def distinct(rows:Seq[Row], f: Row => AnyRef) = {
val b = newBuilder
val seen = mutable.HashSet[AnyRef]()
val key = f(x)
for (x <- rows) {
if (!seen(key)) {
b += x
seen += key
}
}
b.result
}
如果您有一个行列表(其中一行是一个元组),则可以使用第二行基于
If you had a list of rows (where a row is a tuple) you could get the filtered/unique ones based on the second column with
distinct(rows, (_._2))
这篇关于删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!