选择相似的句子 [英] Select similar sentences
问题描述
如果我有一组句子并且我想提取重复项,我应该像下面的例子一样工作:
If I have a set of sentences and I would like to extract the duplicates, I should work like in the following example:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my",
"So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brought this upon you, my")
sentences[duplicated(sentences)]
返回:
[1] "So there I was at the mercy of three monstrous trolls"
[2] "Today is my One Hundred and Eleventh birthday"
[3] "I'm sorry I brought this upon you, my"
但就我而言,我的句子彼此相似(例如,由于拼写错误),我想选择彼此更相似的句子.例如:
But in my case I have sentences that are similar to each other (due to typos, for example) and I would like to select the ones that are more similar to each other. For example:
sentences<-c("So there I was at the mercy of three monstrous trolls",
"Today is my One Hundred and Eleventh birthday",
"I'm sorry I brrrought this upon, my",
"So there I was at mercy of three monstrous troll",
"Today is One Hundred Eleventh birthday",
"I'm sorry I brought this upon you, my")
根据这个例子,我想在以下每一对中选择一个:
According to this example, I would like to select one between each of the following pairs:
I'm sorry I brought this upon you, my
I'm sorry I brrrought this upon, my
Today is One Hundred Eleventh birthday
Today is my One Hundred and Eleventh birthday
So there I was at the mercy of three monstrous trolls
So there I was at mercy of three monstrous troll
RecordLinkage
包中的 levenshteinSim
函数可以帮助我:
The levenshteinSim
function in the RecordLinkage
package could help me:
library(RecordLinkage)
levenshteinSim(sentences[1],sentences[2])
levenshteinSim(sentences[1],sentences[3])
levenshteinSim(sentences[1],sentences[4])
levenshteinSim(sentences[1],sentences[5])
levenshteinSim(sentences[1],sentences[6])
levenshteinSim(sentences[2],sentences[3])
levenshteinSim(sentences[2],sentences[4])
levenshteinSim(sentences[2],sentences[5])
levenshteinSim(sentences[2],sentences[6])
依此类推,为最相似的句子返回接近 1 的值.我可以编写一个双 for 循环
并选择,例如,那些 Levenshtein 编辑距离大于 0.7 的句子对(例如).但是,难道没有更简单的方法来做到这一点吗?
and so on, return values near 1 for the most similar sentences. I could write a double for loop
and select, e.g., those pairs of sentences that have a Levenshtein edit distance greater than 0.7 (e.g.). But, isn't there a more simple way of doing this?
推荐答案
您可以使用基于广义 Levenstein 距离的 adist
计算近似字符串距离矩阵,然后使用hclust
.
You could calculate an approximate string distance matrix using adist
, which is based on a generalized Levenstein distance, and do hierarchical clustering afterwards using hclust
.
ld <- adist(tolower(sentences))
hc <- hclust(as.dist(ld))
data.frame(x=sentences, cl=cutree(hc, h=10))
# x cl
# 1 So there I was at the mercy of three monstrous trolls 1
# 2 Today is my One Hundred and Eleventh birthday 2
# 3 I'm sorry I brrrought this upon, my 3
# 4 So there I was at mercy of three monstrous troll 1
# 5 Today is One Hundred Eleventh birthday 2
# 6 I'm sorry I brought this upon you, my 3
为了在 cutree
中找到 h=
8 的合适值,我们可以绘制树状图.
To find an appropriate value for h=
eight in cutree
we may plot the dendrogram.
plot(hc)
abline(h=10, col=2, lty=2)
这篇关于选择相似的句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!