计算 R 中两个向量/字符串之间的相似度 [英] Calculating similarity between two vectors/Strings in R

查看:70
本文介绍了计算 R 中两个向量/字符串之间的相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能是这个论坛中提出的类似问题,但我觉得我的要求很奇怪.我有一个数据框 df1,它由具有 40,000 个观察值的变量WrittenTerms"组成,我还有另一个数据成名的 df2,其变量SuggestedTerms"具有 17,000 个观察值

It might be similar question asked in this forum but I feel my requirement peculiar. I have a data frame df1 where it consists of variable "WrittenTerms" with 40,000 observations and I have another data-fame df2 with variable "SuggestedTerms" with 17,000 observations

我需要计算书面术语"和建议术语"之间的相似度

I need to calculate the similarity between "written Term" and "suggestedterms"

df1$WrittenTerms

df1$WrittenTerms

头疼

肺癌

腹痛

df2$suggestedterms

df2$suggestedterms

有氧运动

乳腺癌

腹痛

头疼

肺癌

我需要得到如下输出

df1$WrittenTerms df2$suggestedterms Similarity_percentage

头疼头疼50%

肺癌肺癌100%

腹痛腹痛80%

我正在编写下面的代码来满足要求,但它需要更多时间,因为它涉及 for 循环,有什么方法可以使用 TF IDF 或任何其他需要更少时间的方法找到相似性

I am writing the below code to meet the requirement but its taking more time as it involves for loop and is there any way where we can find similarity using TF IDF OR any other approach which will take less time

df_list <- data.frame(check.names = FALSE) # Creating empty dataframe

# calculating similarity between strings.

for(i in df1$WrittenTerms){
  df2$oldsim<- stringdist(i,df2$suggestedterms,method = "lv")
  df2$oldsim <- 1 - df2$oldsim / nchar(as.character(df2$suggestedterms))
  df2 <- head(df2[order(df2$oldsim, decreasing = TRUE),],1)
  df_list <- rbind(df_list, df2)
}

df1 <- cbind(df1, df_list)

推荐答案

基础库的 adist 函数为您提供两个数组之间的 Levenshtein 距离,返回每对条目的距离矩阵.您可以编写一个函数,将 Levenshtein 指标转换为您的转换:

The base library's adist function gives you Levenshtein distances between two arrays, returning a matrix of distances for each pair of entries. You could write a function that converts the Levenshtein metric into your transformation:

my_dist <- function(x, y) 1 - adist(x, y) / nchar(y)
x <- my_dist(df1$WrittenTerms, df2$suggestedterms)

现在为 x 的每一行获取度量的最大值,这将是每个 WrittenTerms 的最佳suggestedterm:

Now obtain the maximum value of your metric for each row of x, which will be the best suggestedterm for each WrittenTerms:

mx <- apply(x, 1, function(y) {mx <- which.max(y); c(y[mx], mx)})

然后可以按如下方式构建您最终所需的数据框:

Your final desired data frame could then be constructed as follows:

data.frame(Written.Terms = df1$WrittenTerms, 
           suggestedterms = df2$suggestedterms[mx[2, ]], 
           Similarity_percentage = mx[1, ])

这篇关于计算 R 中两个向量/字符串之间的相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆