模糊匹配两个字符串 [英] fuzzy matching two strings uring r

查看:100
本文介绍了模糊匹配两个字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个向量,每个向量都包含一系列字符串.例如

I have two vectors, each of which includes a series of strings. For example,

V1=c("pen", "document folder", "warn")
V2=c("pens", "copy folder", "warning")

我需要找到最匹配的两个.我直接使用levenshtein距离.但这还不够好.就我而言,一支笔和一支笔应具有相同的含义.文档文件夹和复制文件夹可能是一回事.警告和警告实际上是相同的.我正在尝试使用tm之类的软件包.但是我不确定哪个函数适合执行此操作.谁能告诉我这件事吗?

I need to find which two are matched the best. I directly use levenshtein distance. But it is not good enough. In my case, pen and pens should mean the same. document folder and copy folder are probably the same thing. warn and warning are actually the same. I am trying to use the packages like tm. But I am not very sure which functions are suitable for doing this. Can anyone tell me about this?

推荐答案

在我的经验中,余弦匹配对于此类工作是很好的选择:

In my experience the cosine match is a good one for such kind of a jobs:

V1 <- c("pen", "document folder", "warn")
V2 <- c("copy folder", "warning", "pens")   
result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 1))
rownames(result) <- V2
result
                  pen document folder      warn
copy folder 0.6797437       0.2132042 0.8613250
warning     0.6150998       0.7817821 0.1666667
pens        0.1339746       0.6726732 0.7500000

当距离足够近时,您必须定义一个截止点,距离越低,匹配度越好.您还可以使用Q参数,该参数说明应将多少个字母组合进行比较.例如:

You have to define a cut off when the distance is close enough, how lower the distance how better they match. You can also play with the Q parameter which says how many letters combinations should be compared to each other. For example:

result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 3))
rownames(result) <- V2
result
                  pen document folder      warn
copy folder 1.0000000       0.5377498 1.0000000
warning     1.0000000       1.0000000 0.3675445
pens        0.2928932       1.0000000 1.0000000

这篇关于模糊匹配两个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆