选择相似的句子 [英] Select similar sentences

查看:43
本文介绍了选择相似的句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有一组句子并且我想提取重复项,我应该像下面的例子一样工作:

If I have a set of sentences and I would like to extract the duplicates, I should work like in the following example:

sentences<-c("So there I was at the mercy of three monstrous trolls",
         "Today is my One Hundred and Eleventh birthday",
         "I'm sorry I brought this upon you, my",
         "So there I was at the mercy of three monstrous trolls",
         "Today is my One Hundred and Eleventh birthday",
         "I'm sorry I brought this upon you, my")

sentences[duplicated(sentences)]

返回:

[1] "So there I was at the mercy of three monstrous trolls"
[2] "Today is my One Hundred and Eleventh birthday"        
[3] "I'm sorry I brought this upon you, my"

但就我而言,我的句子彼此相似(例如,由于拼写错误),我想选择彼此更相似的句子.例如:

But in my case I have sentences that are similar to each other (due to typos, for example) and I would like to select the ones that are more similar to each other. For example:

sentences<-c("So there I was at the mercy of three monstrous trolls",
             "Today is my One Hundred and Eleventh birthday",
             "I'm sorry I brrrought this upon, my",
             "So there I was at mercy of three monstrous troll",
             "Today is One Hundred Eleventh birthday",
             "I'm sorry I brought this upon you, my")

根据这个例子,我想在以下每一对中选择一个:

According to this example, I would like to select one between each of the following pairs:

I'm sorry I brought this upon you, my
I'm sorry I brrrought this upon, my

Today is One Hundred Eleventh birthday
Today is my One Hundred and Eleventh birthday

So there I was at the mercy of three monstrous trolls
So there I was at mercy of three monstrous troll

RecordLinkage 包中的 levenshteinSim 函数可以帮助我:

The levenshteinSim function in the RecordLinkage package could help me:

library(RecordLinkage)


levenshteinSim(sentences[1],sentences[2])
levenshteinSim(sentences[1],sentences[3])
levenshteinSim(sentences[1],sentences[4])
levenshteinSim(sentences[1],sentences[5])
levenshteinSim(sentences[1],sentences[6])

levenshteinSim(sentences[2],sentences[3])
levenshteinSim(sentences[2],sentences[4])
levenshteinSim(sentences[2],sentences[5])
levenshteinSim(sentences[2],sentences[6])

依此类推,为最相似的句子返回接近 1 的值.我可以编写一个双 for 循环 并选择,例如,那些 Levenshtein 编辑距离大于 0.7 的句子对(例如).但是,难道没有更简单的方法来做到这一点吗?

and so on, return values near 1 for the most similar sentences. I could write a double for loop and select, e.g., those pairs of sentences that have a Levenshtein edit distance greater than 0.7 (e.g.). But, isn't there a more simple way of doing this?

推荐答案

您可以使用基于广义 Levenstein 距离的 adist 计算近似字符串距离矩阵,然后使用hclust.

You could calculate an approximate string distance matrix using adist, which is based on a generalized Levenstein distance, and do hierarchical clustering afterwards using hclust.

ld  <- adist(tolower(sentences))
hc <- hclust(as.dist(ld))
data.frame(x=sentences, cl=cutree(hc, h=10))
#                                                       x cl
# 1 So there I was at the mercy of three monstrous trolls  1
# 2         Today is my One Hundred and Eleventh birthday  2
# 3                   I'm sorry I brrrought this upon, my  3
# 4      So there I was at mercy of three monstrous troll  1
# 5                Today is One Hundred Eleventh birthday  2
# 6                 I'm sorry I brought this upon you, my  3

为了在 cutree 中找到 h=8 的合适值,我们可以绘制树状图.

To find an appropriate value for h=eight in cutree we may plot the dendrogram.

plot(hc)
abline(h=10, col=2, lty=2)

这篇关于选择相似的句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆