以相同的方式处理由空格分隔的单词 [英] Treat words separated by space in the same manner

查看:33
本文介绍了以相同的方式处理由空格分隔的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试查找同时出现在多个文档中的单词.

I am trying to find the words occurring in multiple documents at the same time.

让我们举个例子.

doc1: "this is a document about milkyway"
doc2: "milky way is huge"

正如您在以上 2 个文档中所见,两个文档中都出现了milkyway"一词,但在第二个文档中,milkyway"一词由空格分隔,而在第一个文档中则没有.

As you can see in above 2 documents, word "milkyway" is occurring in both the docs but in the second document term "milkyway" is separated by a space and in first doc it is not.

我正在执行以下操作以获取 R 中的文档术语矩阵.

I am doing the following to get the document term matrix in R.

library(tm)
tmp.text <- data.frame(rbind(doc1, doc2))
tmp.corpus <- Corpus(DataframeSource(tmp.text))
tmpDTM <- TermDocumentMatrix(tmp.corpus, control = list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df <- as.data.frame(as.matrix(tmpDTM))
tmp.df

         1 2
document 1 0
huge     0 1
milky    0 1
milkyway 1 0
way      0 1

术语 milkyway 仅出现在第一个文档中,按照上述矩阵.

Term milkyway is only present in the first doc as per the above matrix.

我希望能够在上述矩阵中术语milkyway"的两个文档中都获得 1.这只是一个例子.我需要为很多文档执行此操作.最终,我希望能够以类似的方式对待这些词(milkyway"和milkyway").

I want to be able to get 1 in both the docs for term "milkyway" in the above matrix. This is just an example. I need to do this for a lot of documents. Ultimately I want to be able to treat such words ("milkyway" & "milky way") in a similar manner.

编辑 1:

我不能强迫术语文档矩阵以这样一种方式计算,即对于它试图查找的任何单词,它不应该只是将该单词作为字符串中的单独单词查找,还应该在字符串中查找?例如,一个术语是 milky 并且有一个文档 this is Milkyway 所以这里目前 milky 不会出现在这个文档中但是如果算法在字符串中查找有问题的单词,它还会在字符串 milkyway 中找到单词 milky,这样的话单词 milkyway 将计入我的两个文档(前面的示例).

Can't I force the term document matrix to get calculated in such a way that for whatever word it is trying to look for it shouldn't just look for that word as a separate word in the string but also within strings? For example, one term is milky and there is a document this is milkyway so here currently milky does not occur in this document but if the algorithm looks for the word in question within strings also it will find the word milky within string milkyway, that way words milky and way will get counted in my both documents (earlier example).

编辑 2:

最终我希望能够计算文档之间的相似度余弦指数.

Ultimately I want to be able to calculate similarity cosine index between documents.

推荐答案

您需要在之前将文档转换为一包 primitive-word 表示.其中 primitive-word 与一组词匹配.原始词也可以在语料库中.

You will need to convert documents to a bag of primitive-word representation before. Where a primitive-word is matched with a set of words. The primitive word can also be in the corpus.

例如:

milkyway -> {milky, milky way, milkyway} 
economy -> {economics, economy}
sport -> {soccer, football, basket ball, basket, NFL, NBA}

你可以在用同义词词典和编辑距离计算余弦距离之前构建这样的词典,就像 levenstein 这样将完成同义词词典.

You can build such dictionary before computing the cosine distance with both a synonyms dictionary and a edit distance like levenstein which will complete synonym dictionary.

计算运动"键更复杂.

这篇关于以相同的方式处理由空格分隔的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆