基于语义相似性/相关性从列表中删除重复项 [英] remove duplicates from list based on semantic similarity/relatedness

查看:25
本文介绍了基于语义相似性/相关性从列表中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

R + tm:如何根据语义相似性去除列表中的重复项?v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv").我预期的解决方案是 c("bank", "ford_suv',"toyota_suv","nissan_suv").也就是说,bank、banks 和banking 被简化为一个术语bank".SnowBall::stemming 不是一个选项,因为我要保留各国报纸风格的味道.任何帮助或指导将是有用的.

解决方案

我们可以使用 adist 计算单词之间的 Levenshtein 距离,并使用 hclust 将它们重新组合成簇

>

d <- adist(v)行名(d) <- v

给出项之间距离的矩阵:

# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]#bank 0 1 3 8 9 8 2 13 6 5 3 4#banks 1 0 3 7 9 7 2 13 6 6 2 5#银行业务 3 3 0 8 10 8 3 13 7 6 3 7#ford_suv 8 7 8 0 5 6 8 12 7 7 8 4#toyota_suv 9 9 10 5 0 6 9 7 4 9 9 9#nissan_suv 8 7 8 6 6 0 8 13 10 4 8 10#banker 2 2 3 8 9 8 0 12 6 6 1 6#toyota_corolla 13 13 13 12 7 13 12 0 8 13 12 12#丰田 6 6 7 7 4 10 6 8 0 6 7 5#日产 5 6 6 7 9 4 6 13 6 0 7 6#银行家 3 2 3 8 9 8 1 12 7 7 0 6#福特 4 5 7 4 9 10 6 12 5 6 6 0

然后我们可以使用 method = ward.D 将它传递给 hclust

cl <- hclust(as.dist(d), method = "ward.D")情节(CL)

给出:

我们注意到 4 个不同的集群(我们可以使用 rect.hclust(cl, 4) 来说明)

现在,我们可以将这个结果转换成一个data.frame,并用它的最短术语标记每个集群:

库(dplyr)data.frame(group = cuttree(cl, 4)) %>%tibble::rownames_to_column("term") %>%group_by(group)%>%变异(标签 = 术语 [nchar(术语)== 分钟(nchar(术语))])

给出:

#Source: 本地数据框 [12 x 3]#Groups: 组 [4]## 术语组标签# <chr><int><chr>#1 银行 1 银行#2 银行 1 银行#3 银行 1 银行#4 Ford_suv 2 福特#5 toyota_suv 3 丰田#6 nissan_suv 4 日产#7 银行家 1 家银行#8 丰田卡罗拉 3 丰田#9 丰田 3 丰田#10 日产 4 日产#11 银行家 1 家银行#12福特2福特

如果我们只想为每个集群提取唯一的 tag,我们可以添加 ... %>% distinct(tag) %>% .$tag 给管道:

#[1] "bank" "ford" "toyota" "nissan"

<小时>

参考

?adist

<块引用>

两个字符串之间的(广义)Levenshtein(或编辑)距离st 是插入、删除的最小可能加权数以及将 s 转换为 t 所需的替换(以便转换完全匹配t).

?hclust

<块引用>

该函数使用一组被聚类的 n 个对象的不同之处.最初,每个对象被分配到它自己的集群,然后算法继续迭代地,在每个阶段加入两个最相似的集群,继续直到只有一个集群.

<小时>

注意:我在评论中使用了@Abdou 提供的数据,因为它代表了一个更完整的用例

R + tm: How do I de-duplicate items in a list, based on semantic similarity? v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv"). My expected solution would be c("bank", "ford_suv',"toyota_suv","nissan_suv"). That is, bank, banks and banking to be reduced to one term "bank." SnowBall::stemming is not an option because I have to retain the flavor of newspaper styles of various countries. Any help or direction will be useful.

解决方案

We could calculate the Levenshtein distance between words using adist and regroup them into clusters using hclust

d <- adist(v)
rownames(d) <- v

Which gives a matrix of distance between terms:

#              [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#bank              0    1    3    8    9    8    2   13    6     5     3     4
#banks             1    0    3    7    9    7    2   13    6     6     2     5
#banking           3    3    0    8   10    8    3   13    7     6     3     7
#ford_suv          8    7    8    0    5    6    8   12    7     7     8     4
#toyota_suv        9    9   10    5    0    6    9    7    4     9     9     9
#nissan_suv        8    7    8    6    6    0    8   13   10     4     8    10
#banker            2    2    3    8    9    8    0   12    6     6     1     6
#toyota_corolla   13   13   13   12    7   13   12    0    8    13    12    12
#toyota            6    6    7    7    4   10    6    8    0     6     7     5
#nissan            5    6    6    7    9    4    6   13    6     0     7     6
#bankers           3    2    3    8    9    8    1   12    7     7     0     6
#ford              4    5    7    4    9   10    6   12    5     6     6     0

Then we can pass it to hclust using method = ward.D

cl <- hclust(as.dist(d), method  = "ward.D")
plot(cl)

Which gives:

We notice 4 distinct clusters (that we can illustrate using rect.hclust(cl, 4))

Now, we can turn this result into a data.frame and tag each cluster with it's shortest term:

library(dplyr)
data.frame(group = cutree(cl, 4)) %>%
  tibble::rownames_to_column("term") %>%
  group_by(group) %>%
  mutate(tag = term[nchar(term) == min(nchar(term))]) 

Which gives:

#Source: local data frame [12 x 3]
#Groups: group [4]
#
#             term group      tag
#            <chr> <int>    <chr>
#1            bank     1     bank
#2           banks     1     bank
#3         banking     1     bank
#4        ford_suv     2     ford
#5      toyota_suv     3   toyota
#6      nissan_suv     4   nissan
#7          banker     1     bank
#8  toyota_corolla     3   toyota
#9          toyota     3   toyota
#10         nissan     4   nissan
#11        bankers     1     bank
#12           ford     2     ford

Should we want to extract only the unique tag for each cluster, we could add ... %>% distinct(tag) %>% .$tag to the pipe which would give:

#[1] "bank"   "ford"   "toyota" "nissan"


Reference

?adist

The (generalized) Levenshtein (or edit) distance between two strings s and t is the minimal possibly weighted number of insertions, deletions and substitutions needed to transform s into t (so that the transformation exactly matches t).

?hclust

This function performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.


Note: I used data provided by @Abdou in the comments as it represents a more complete use case

这篇关于基于语义相似性/相关性从列表中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆