使用R中的同义词识别近乎重复的条目 [英] Identifying near duplicate entries using synonyms in R

查看：288 发布时间：2017/7/21 18:44:19 r duplicate-removal synonym duplicates

本文介绍了使用R中的同义词识别近乎重复的条目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试识别数据库中名称的重复条目。我是数据库的新手，但是我很熟悉R.我可以使用R中的模糊匹配和soundex来获取近乎重复的聚类。但是有几个名字是彼此的同义词。我想根据这个标准和上面的标准来集中名称。

I am trying to identify near duplicate entries of names in a database. I am new to databases, however i am familiar with R. I can get clusters of near duplicates using fuzzy matching and soundex in R. However there are several names which are synonyms of each other. I would like to cluster the names based on this criteria along with the above ones.

我想按照用于查找近重复记录的技术但具有同义词。我知道有一种名为WordNet的英文单词的同义词数据库，同义词集合称为synsets。但是字段名称中的条目是不同的格式和语言。

I want to do as suggested in Techniques for finding near duplicate records but with synonyms. I understand there is a sort of database of synonyms for English words called WordNet with sets of synonyms called synsets. But the entries in the field names are in different formats and languages.

例如，如果知道R 3.0.3和Warm Puppy是同义词。我想要使用这样的自定义synsets syn1 - c（R版本3.0.3，温暖小狗）用于在重复项附近进行聚类。

For example If know "R version 3.0.3" and "Warm Puppy" are synonyms. I want to be able to use such custom synsets syn1 <- c("R version 3.0.3", "Warm Puppy") for clustering near duplicates.

Down道路我也想根据记录的其他字段（列）中的条目分离同音异义。

Down the road I would also like to separate homonyms in clusters based on entries in other fields(columns) of a record.

在R中是否有任何方法实现？ / p>

Is there any method to implement this in R?

使用R中的同义词识别近乎重复的条目 [英] Identifying near duplicate entries using synonyms in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用R中的同义词识别近乎重复的条目 [英] Identifying near duplicate entries using synonyms in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭