字符串匹配以估计相似性 [英] String matching to estimate similarity

查看:108
本文介绍了字符串匹配以估计相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想分析一个100个字符长的字段,并估计相似度%.例如,对于同一问题您对智能手机有何看法?",

I want to analyse a field of 100 character length and estimate similarity %. For example, for a same question "Whats your opinion on smartphone?",

人员A: 浪费钱的最佳方法"

人员B: 很棒的东西.让您一直保持联系"

人员C: 浪费时间和金钱的仪器"

其中,仅通过匹配单个单词,A和C听起来相似.我试图做这样的事情,从 R 开始,然后扩展到匹配最佳",最佳方式",最佳方式浪费"等词语的组合.我是新手文本分析和R,无法正确地命名这些方法,从而无法有效地进行搜索.

Out of these, just by matching individual words, A and C sound similar. I am trying to do something like this to start with in R and later on extend to match combination of words like "Best", "Best way", "Best way waste" etc. I am newbie to text analysis and R and could not get the proper naming of these methods to search effectively.

请指导我提供您的意见和参考.在此先感谢

Please guide me with your inputs and references. Thanks In Advance

推荐答案

这里是手动查看相似度百分比的潜在解决方案.

Here is a potential solution for manually looking at percent similarity.

a <- "Best way to waste money"
b <- "Amazing stuff. lets you stay connected all the time"
c <- "Instrument to waste money and time"

format <- function(string1){ #removing the information from the string which presumably isn't important (punctuation, capital letters. then splitting all the words into separate strings)
  lower <- tolower(string1)
  no.punct <- gsub("[[:punct:]]", "", lower)
  split <- strsplit(no.punct, split=" ")
  return(split)
}

a <- format(a)
b <- format(b)
c <- format(c)

sim.per <- function(str1, str2, ...){#how similar is string 1 to string 2. NOTE: the order is important, ie. sim.per(b,c) is different from sim.per(c,b)
  sim <- length(intersect(str1[[1]], str2[[1]]))#intersect function counts the common strings
  total <- length(str1[[1]])
  per <- sim/total
  return(per)
}

#test
sim.per(b, c)

我希望能对您有所帮助!要搜索单词的组合,您将不得不做更多的向导.我想尝试进行编辑以准确显示您要查找的内容,可能会得到更多的答案!

I hope that helps! To search for combinations of words you would have to do some more wizardry. I guess try and make an edit to show exactly what you're looking for and you might have more luck with an answer!

作为参考,请查看Gaston Sanchez的在R中处理和处理字符串",这很棒.

As for references, check out "Handling and Processing Strings in R" by Gaston Sanchez, it's great.

这篇关于字符串匹配以估计相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆