查找文档的余弦相似度并将其从R数据框中删除 [英] Finding cosine similarity of documents and their removal from R dataframe

查看：124 发布时间：2020/5/18 1:07:45 r xml nlp cosine-similarity

本文介绍了查找文档的余弦相似度并将其从R数据框中删除的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理仅包含每行文档号和文本数据的数据框.此数据是从xml文件导出的.数据的格式为数据帧，位于变量text_df中:

I am working on the data frame which contains data per row doc number and text only. This data was exported from xml file. The data is of form dataframe in variable text_df :

行/文本

line/ text

 1 when uploading objective file bugzilla se
 2 spelling mistake docs section searching fo…
 3 editparams cgi won save updates iis instal…
 4 editparams cgi won save updates            
 5 rfe unsubscribe from bug you reported      
 6 unsubscribe from bug you reported

我正在使用以下代码来识别和删除重复项.

I am using the following code to identify and remove the duplicates.

doc_set_1 = text_df
it1 = itoken(doc_set_1$text, progressbar = FALSE)

# specially take different number of docs in second set
doc_set_2 = text_df
it2 = itoken(doc_set_2$text, progressbar = FALSE)
it = itoken(text_df$text, progressbar = FALSE)
 v = create_vocabulary(it) %>% prune_vocabulary(doc_proportion_max = 
 0.1, term_count_min = 5)
 vectorizer = vocab_vectorizer(v)
 dtm1 = create_dtm(it1, vectorizer)
 dtm2 = create_dtm(it2, vectorizer)
 d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
  mat<-(d1_d2_cos_sim)
  mat[lower.tri(mat,diag=TRUE)] <- 0
  ## for converting a sparse matrix into dataframe
  mdf<- as.data.frame(as.matrix(mat))
  datalist = list()
  for (i in 1:nrow(mat)) {
   t<-which(mat[i,]>0.8)
   if(length(t)>1){
   datalist[[i]] <- t # add it to your list
      }
    }

  #Number of Duplicates Found
  length(unique(unlist(datalist)))

   tmdf<- subset(mdf,select=-c(unique(unlist(datalist))))

  # Removing the similar documents
  text_df<-text_df[names(tmdf),]
  nrow(text_df)

此代码需要大量时间来解决，欢迎提出任何建议以使其更好.

This code takes lot of time for solving, Any suggestions to make it better are welcome.

推荐答案

在这种情况下，库quanteda效果很好.在下面，我提供一个示例:

the library quanteda works quite well on this case. Here below I provide an example:

library(tibble)
library(quanteda)
df<- data_frame(text = c("when uploading objective file bugzilla se",
       "spelling mistake docs section searching fo",
       "editparams cgi won save updates iis instal",
       "editparams cgi won save updates",
       "rfe unsubscribe from bug you reported",
       "unsubscribe from bug you reported"))
DocTerm <- quanteda::dfm(df$text)
textstat_simil(DocTerm, margin="documents", method = "cosine")
          text1     text2     text3     text4     text5
text2 0.0000000                                        
text3 0.0000000 0.0000000                              
text4 0.0000000 0.0000000 0.8451543                    
text5 0.0000000 0.0000000 0.0000000 0.0000000          
text6 0.0000000 0.0000000 0.0000000 0.0000000 0.9128709
    textstat_simil(DocTerm, margin="documents", method = "cosine")

如果要按特定数量对它进行子集化，并查看哪些与特定数字相似(在此0.9中)，则可以执行以下操作:

If one wants to subset it by an specific amount and see which ones are similar than a specific number (in this 0.9), one can do as following:

mycosinesim<-textstat_simil(DocTerm, margin="documents", method = "cosine")
myMatcosine<-as.data.frame(as.matrix(mycosinesim))
higherthan90<-as.data.frame(which(myMatcosine>0.9,arr.ind = T, useNames = T))
higherthan90[which(higherthan90$row !=higherthan90$col),]

row col
text6     6   5
text5.1   5   6

现在，您可以决定是删除5还是6，因为它们确实很相似

Now you can decide whether to remove 5 or 6 as text since they are really similar

这篇关于查找文档的余弦相似度并将其从R数据框中删除的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

查找文档的余弦相似度并将其从R数据框中删除 [英] Finding cosine similarity of documents and their removal from R dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

查找文档的余弦相似度并将其从R数据框中删除 [英] Finding cosine similarity of documents and their removal from R dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭