基于释义检测查找相似文本 [英] Find similar texts based on paraphrase detection

查看:51
本文介绍了基于释义检测查找相似文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有兴趣根据释义找到类似的内容(文本).我该怎么做呢?有没有可以做到这一点的特定工具?最好在python中.

I am interested in finding similar content(text) based on paraphrasing. How do I do this? Are there any specific tools which can do this? In python preferably.

推荐答案

我相信您正在寻找的工具是潜在语义分析.

I believe the tool you are looking for is Latent Semantic Analysis.

鉴于我的帖子会很长,我不会详细解释它背后的理论——如果你认为它确实是你要找的东西,我建议你查一下.最好的起点是这里:

Given that my post is going to be quite lengthy, I'm not going to go into much detail explaining the theory behind it - if you think that it is indeed what you are looking for, the I recommend you look it up. The best place to start would be here:

http://staff.scm.uws.edu.au/~lapark/lt.pdf

总而言之,LSA 试图基于相似词出现在相似文档中的假设来揭示词和短语的潜在/潜在含义.我将使用 R 来演示它是如何工作的.

In summary, LSA attempts to uncover the underlying / latent meaning of words and phrases based on the assumption that similar words appear in similar documents. I'll be using R to demonstrate how it works.

我将建立一个函数,根据它们的潜在含义检索相似的文档:

I'm going to set up a function that is going to retrieve similar documents based on their latent meaning:

# Setting up all the needed functions:

SemanticLink = function(text,expression,LSAS,n=length(text),Out="Text"){ 

  # Query Vector
  LookupPhrase = function(phrase,LSAS){ 
    lsatm = as.textmatrix(LSAS) 
    QV = function(phrase){ 
      q = query(phrase,rownames(lsatm)) 
      t(q)%*%LSAS$tk%*%diag(LSAS$sk) 
    } 

    q = QV(phrase) 
    qd = 0 

    for (i in 1:nrow(LSAS$dk)){ 
      qd[i] <- cosine(as.vector(q),as.vector(LSAS$dk[i,])) 
    }  
    qd  
  } 

  # Handling Synonyms
  Syns = function(word){   
    wl    =   gsub("(.*[[:space:]].*)","", 
                   gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","", 
                        unlist(strsplit(PlainTextDocument(synonyms(word)),",")))) 
    wl = wl[wl!=""] 
    return(wl)  
  } 

  ex = unlist(strsplit(expression," "))
  for(i in seq(ex)){ex = c(ex,Syns(ex[i]))}
  ex = unique(wordStem(ex))

  cache = LookupPhrase(paste(ex,collapse=" "),LSAS) 

  if(Out=="Text"){return(text[which(match(cache,sort(cache,decreasing=T)[1:n])!="NA")])} 
  if(Out=="ValuesSorted"){return(sort(cache,decreasing=T)[1:n]) } 
  if(Out=="Index"){return(which(match(cache,sort(cache,decreasing=T)[1:n])!="NA"))} 
  if(Out=="ValuesUnsorted"){return(cache)} 

} 

请注意,在组装查询向量时,我们在这里使用了同义词.这种方法并不完美,因为 qdap 库中的一些同义词充其量是远程的......这可能会干扰您的搜索查询,因此要获得更准确但不太通用的结果,您可以简单地去掉同义词,手动选择构成查询向量的所有相关术语.

Note that that we make use of synonyms here when assembling our query vector. This approach isn't perfect because some of the synonyms in the qdap library are remote at best... This may interfere with your search query, so to achieve more accurate but less generalizable results, you can simply get rid of the synonyms bit and manually select all relevant terms that make up your query vector.

让我们试一试.我还将使用 RTextTools 包中的美国国会数据集:

Let's try it out. I'll also be using the US Congress dataset from the package RTextTools:

library(tm)
library(RTextTools)
library(lsa)
library(data.table)
library(stringr)
library(qdap)

data(USCongress)

text = as.character(USCongress$text)

corp = Corpus(VectorSource(text)) 

parameters = list(minDocFreq        = 1, 
                  wordLengths       = c(2,Inf), 
                  tolower           = TRUE, 
                  stripWhitespace   = TRUE, 
                  removeNumbers     = TRUE, 
                  removePunctuation = TRUE, 
                  stemming          = TRUE, 
                  stopwords         = TRUE, 
                  tokenize          = NULL, 
                  weighting         = function(x) weightSMART(x,spec="ltn"))

tdm = TermDocumentMatrix(corp,control=parameters)
tdm.reduced = removeSparseTerms(tdm,0.999)

# setting up LSA space - this may take a little while...
td.mat = as.matrix(tdm.reduced) 
td.mat.lsa = lw_bintf(td.mat)*gw_idf(td.mat) # you can experiment with weightings here
lsaSpace = lsa(td.mat.lsa,dims=dimcalc_raw()) # you don't have to keep all dimensions
lsa.tm = as.textmatrix(lsaSpace)

l = 50 
exp = "support trade" 
SemanticLink(text,exp,n=5,lsaSpace,Out="Text") 

[1] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small businesses, and for other purposes."                                                                       
[2] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel AJ."           
[3] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the yacht EXCELLENCE III."
[4] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel M/V Adios."    
[5] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small business, and for other purposes." 

如您所见,虽然支持交易"可能不会出现在上面的示例中,但该函数已检索到一组与查询相关的文档.该函数旨在检索具有语义链接而非精确匹配的文档.

As you can see, that while "support trade" may not appear as such in the above example, the function has retrieved a set of documents which are relevant to the query. The function is designed to retrieve documents with semantic linkages rather than exact matches.

我们还可以通过绘制余弦距离来查看这些文档与查询向量的接近"程度:

We can also see how "close" these documents are to the query vector by plotting the cosine distances:

plot(1:l,SemanticLink(text,exp,lsaSpace,n=l,Out="ValuesSorted") 
     ,type="b",pch=16,col="blue",main=paste("Query Vector Proximity",exp,sep=" "), 
     xlab="observations",ylab="Cosine") 

不过,我还没有足够的声誉来制作情节,抱歉.

I don't have enough reputation yet to produce the plot though, sorry.

如您所见,前 2 个条目似乎比其他条目与查询向量的关联度更高(尽管其中大约有 5 个条目特别相关),尽管您不会阅读它们.我会说这是使用同义词构建查询向量的效果.然而,忽略这一点,该图允许我们有多少其他文档与查询向量远程相似.

As you would see, the first 2 entries appear to be more associated with the query vector than the rest (there are about 5 that are particularly relevant though), even though reading though them you wouldn't have though so. I would say that this is the effect of using synonyms to build your query vectors. Ignoring that however, the graph allows us how many other documents are remotely similar to the query vector.

就在最近,我不得不解决您正在尝试解决的问题,但上述功能无法正常工作,仅仅是因为数据很糟糕 - 文本很短,内容很少,而且没有探索多少主题.因此,为了在这种情况下找到相关条目,我开发了另一个纯粹基于正则表达式的函数.

Just recently, I've had to solve the problem you are trying to solve, but the above function just wouldn't work well, simply because the data was atrocious - the text was short, there was very little of it and not many topics were explored. So to find relevant entries in such situations, I've developed another function that is purely based on regular expressions.

这里是:

HLS.Extract = function(pattern,text=active.text){


  require(qdap)
  require(tm)
  require(RTextTools)

  p = unlist(strsplit(pattern," "))
  p = unique(wordStem(p))
  p = gsub("(.*)i$","\\1y",p)

  Syns = function(word){   
    wl    =   gsub("(.*[[:space:]].*)","",      
                   gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","",  
                        unlist(strsplit(PlainTextDocument(synonyms(word)),",")))) 
    wl = wl[wl!=""] 
    return(wl)     
  } 

  trim = function(x){

    temp_L  = nchar(x)
    if(temp_L < 5)                {N = 0}
    if(temp_L > 4 && temp_L < 8)  {N = 1}
    if(temp_L > 7 && temp_L < 10) {N = 2}
    if(temp_L > 9)                {N = 3}
    x = substr(x,0,nchar(x)-N)
    x = gsub("(.*)","\\1\\\\\\w\\*",x)

    return(x)
  }

  # SINGLE WORD SCENARIO

  if(length(p)<2){

    # EXACT
    p = trim(p)
    ndx_exact  = grep(p,text,ignore.case=T)
    text_exact = text[ndx_exact]

    # SEMANTIC
    p = unlist(strsplit(pattern," "))

    express  = new.exp = list()
    express  = c(p,Syns(p))
    p        = unique(wordStem(express))

    temp_exp = unlist(strsplit(express," "))
    temp.p = double(length(seq(temp_exp)))

    for(j in seq(temp_exp)){
      temp_exp[j] = trim(temp_exp[j])
    }

    rgxp   = paste(temp_exp,collapse="|")
    ndx_s  = grep(paste(temp_exp,collapse="|"),text,ignore.case=T,perl=T)
    text_s = as.character(text[ndx_s])

    f.object = list("ExactIndex"    = ndx_exact,
                    "SemanticIndex" = ndx_s,
                    "ExactText"     = text_exact,
                    "SemanticText"  = text_s)
  }

  # MORE THAN 2 WORDS

  if(length(p)>1){

    require(combinat)

    # EXACT
    for(j in seq(p)){p[j] = trim(p[j])}

    fp     = factorial(length(p))
    pmns   = permn(length(p))
    tmat   = matrix(0,fp,length(p))
    permut = double(fp)
    temp   = double(length(p))
    for(i in 1:fp){
      tmat[i,] = pmns[[i]]
    }

    for(i in 1:fp){
      for(j in seq(p)){
        temp[j] = paste(p[tmat[i,j]])
      }
      permut[i] = paste(temp,collapse=" ")
    }

    permut = gsub("[[:space:]]",
                  "[[:space:]]+([[:space:]]*\\\\w{,3}[[:space:]]+)*(\\\\w*[[:space:]]+)?([[:space:]]*\\\\w{,3}[[:space:]]+)*",permut)

    ndx_exact  = grep(paste(permut,collapse="|"),text)
    text_exact = as.character(text[ndx_exact])


    # SEMANTIC

    p = unlist(strsplit(pattern," "))
    express = list()
    charexp = permut = double(length(p))
    for(i in seq(p)){
      express[[i]] = c(p[i],Syns(p[i]))
      express[[i]] = unique(wordStem(express[[i]]))
      express[[i]] = gsub("(.*)i$","\\1y",express[[i]])
      for(j in seq(express[[i]])){
        express[[i]][j] = trim(express[[i]][j])
      }
      charexp[i] = paste(express[[i]],collapse="|")
    }

    charexp  = gsub("(.*)","\\(\\1\\)",charexp)
    charexpX = double(length(p))
    for(i in 1:fp){
      for(j in seq(p)){
        temp[j] = paste(charexp[tmat[i,j]])
      }
      permut[i] = paste(temp,collapse=
                          "[[:space:]]+([[:space:]]*\\w{,3}[[:space:]]+)*(\\w*[[:space:]]+)?([[:space:]]*\\w{,3}[[:space:]]+)*")
    }
    rgxp   = paste(permut,collapse="|")
    ndx_s  = grep(rgxp,text,ignore.case=T)
    text_s = as.character(text[ndx_s])

    temp.f = function(x){
      if(length(x)==0){x=0}
    }

    temp.f(ndx_exact);  temp.f(ndx_s)
    temp.f(text_exact); temp.f(text_s)

    f.object = list("ExactIndex"    = ndx_exact,
                    "SemanticIndex" = ndx_s,
                    "ExactText"     = text_exact,
                    "SemanticText"  = text_s,
                    "Synset"        = express)

  }
  return(f.object)
  cat(paste("Exact Matches:",length(ndx_exact),sep=""))
  cat(paste("\n"))
  cat(paste("Semantic Matches:",length(ndx_s),sep=""))
}

尝试一下:

HLS.Extract("buy house",
            c("we bought a new house",
              "I'm thinking about buying a new home",
              "purchasing a brand new house"))[["SemanticText"]]

$SemanticText
[1] "I'm thinking about buying a new home" "purchasing a brand new house"

如您所见,该功能非常灵活.它还将选择购房".但是它没有选择我们买了一栋新房子",因为买"是一个不规则动词 - 这是 LSA 会选择的那种东西.

As you can see, the function is quite flexible. It would also pick up "home buying". It didn't pick up "we bought a new house" though, because "bought" is an irregular verb - it's the kind of thing that LSA would have picked up.

因此,您可能想同时尝试这两种方法,看看哪一种效果更好.SemanticLink 功能也需要大量内存,当你有一个特别大的语料时,你将无法使用它

So you may like to try both and see which one works better. The SemanticLink function also requires a ton of memory, and when you have a particularly large corpus, you won't be able to use it

干杯

这篇关于基于释义检测查找相似文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆