简单比较R中的两个文本 [英] Simple Comparing of two texts in R

查看:148
本文介绍了简单比较R中的两个文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想比较两个文本的相似性,因此我需要一个简单的函数列出清楚和按时间顺序出现在两个文本中的单词和短语。这些单词/句子应该突出显示或加下划线以便更好的可视化)



在@joris Meys的想法的基础上,我添加了一个数组,将文本分为句子和从属句。



这是它的外观:

  textparts< function(text){
textparts< - c(\\,,\\。)
i < - 1
while(i <= length textparts)){
text< - unlist(strsplit(text,textparts [i]))
i < - i + 1
}
return b)

textparts1< - textparts(这是一个完整的句子,而这是一个从属子句,这个东西工作。)
textparts2< - textparts

commonWords< - intersect(textparts1,textparts2)
commonWords< - paste( \\<(,commonWords,)\\>,sep =)


for(x in commonWords){
textparts1 < - gsub(x,\\1 *,textparts1,ignore.case = TRUE)
textparts2< - gsub(x,\\1 *,textparts2,ignore.case = TRUE)
}
return(list(textparts1,textparts2))



我会喜欢这样的结果:

 > return(list(textparts1,textparts2))
[[1]]
[1]这是一个完整的句子,而这是一个从属子句*this thing works * b
[[2]]
[1]这可以是一个句子,而这是一个从属子句*剽窃不酷这件事情工作*
<

解决方案

$ / code>



@Chase的回答有一些问题:




  • 大小写不同。

  • 互动功能可能会弄乱结果

  • 如果有多个单词相似,那么您会因为gsub调用而收到很多警告。



根据他的想法,有以下解决方案使用 tolower()正则表达式的好的功能:

  compareSentences<  -  function(sentence1,sentence2){
#不是一个字,并把所有的小写
x1< - tolower(unlist(strsplit(sentence1,\\W)))
x2< - tolower(unlist(strsplit ,\\W)))

commonWords< - intersect(x1,x2)
#添加单词的开始和结束并在()
#以允许在gsub
中匹配引用commonWords< - paste(\\<(,commonWords,)\\>,sep =)


for(x in commonWords){
#用匹配加上星号添加的匹配
sentence1< - gsub(x,\\1 *,sentence1,
}
return(list(sentence1,ignore.case = TRUE)
sentence2< - gsub(x,\\1 *,sentence2,ignore.case = TRUE) sentence2))
}

这会产生以下结果:

  text1<  - 这是一个测试。天气很好
text2< - 本文是一个测试。这个天气很好。这个blabalba这个

compareSentences(text1,text2)
[[1]]
[1]这个*是* a * test *。天气*是*罚款

[[2]]
[1]这*文本是* a *测试*。这*天气*是*罚款*。 This * blabalba This *


I want to compare two texts to similarity, therefore i need a simple function to list clearly and chronologically the words and phrases occurring in both texts. these words/sentences should be highlighted or underlined for better visualization)

on the base of @joris Meys ideas, i added an array to divide text into sentences and subordinate sentences.

this is how it looks like:

  textparts <- function (text){
  textparts <- c("\\,", "\\.")
  i <- 1
  while(i<=length(textparts)){
        text <- unlist(strsplit(text, textparts[i]))
        i <- i+1
  }
  return (text)
}

textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")

  commonWords <- intersect(textparts1, textparts2)
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")


  for(x in commonWords){
    textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
    textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
  }
  return(list(textparts1,textparts2))

However, sometimes it works, sometimes it doesn't.

I WOULD like to have results like these:

>   return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence"         " whereas this is a dependent clause*" " This thing works*"                  

[[2]]
[1] "This could be a sentence"            " whereas this is a dependent clause*" " Plagiarism is not cool"             " This thing works*"           

whereas i get none results.

解决方案

There are some problems with the answer of @Chase :

  • differences in capitalization are not taken into account
  • interpunction can mess up results
  • if there is more than one word similar, then you get a lot of warnings due to the gsub call.

Based on his idea, there is the following solution that makes use of tolower() and some nice functionalities of regular expressions :

compareSentences <- function(sentence1, sentence2) {
  # split everything on "not a word" and put all to lowercase
  x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
  x2 <- tolower(unlist(strsplit(sentence2, "\\W")))

  commonWords <- intersect(x1, x2)
  #add word beginning and ending and put words between ()
  # to allow for match referencing in gsub
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")


  for(x in commonWords){ 
    # replace the match by the match with star added
    sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
    sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)
  }
  return(list(sentence1,sentence2))      
}

This gives following result :

text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "

compareSentences(text1,text2)
[[1]]
[1] "This* is* a* test*. Weather* is* fine*"

[[2]]
[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "

这篇关于简单比较R中的两个文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆