跟踪单词邻近度 [英] Keeping Track of Word Proximity
问题描述
我正在做一个小项目,该项目涉及在文档集合中进行基于字典的文本搜索.我的字典有积极的信号词(又名好词),但在文档集合中,仅仅找到一个词并不能保证肯定的结果,因为可能有负面词,例如(不,不重要)可能与这些积极词接近.我想构建一个矩阵,使其包含文档编号、正词及其与负词的接近度.
I am working on a small project which involves a dictionary based text searching within a collection of documents. My dictionary has positive signal words (a.k.a good words) but in the document collection just finding a word does not guarantee a positive result as there may be negative words for example (not, not significant) that may be in the proximity of these positive words. I want to construct a matrix such that it contains the document number,positive word and its proximity to negative words.
任何人都可以提出一种方法来做到这一点.我的项目处于非常非常早期的阶段,所以我给出了我的文本的一个基本示例.
Can anyone please suggest a way to do that. My project is at a very very early stage so I am giving a basic example of my text.
No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.
这是我的示例文档,其中坎地沙坦酯、格列本脲、硝苯地平、地高辛、华法林、氢氯噻嗪是我的肯定词,我的否定词没有显着意义.我想在我的正面词和负面词之间做一个接近度(基于词的)映射.
This is my example document in which candesartan cilexetil, glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide are my positive words and no significant is my negative word. I want to do a proximity (word based) mapping between my positive and nevative words.
谁能提供一些有用的提示?
Can anyone give some helpful pointers?
推荐答案
首先,我建议不要将 R 用于此任务.R 在许多方面都很棒,但文本操作不是其中之一.Python 可能是一个不错的选择.
First of all I would suggest not to use R for this task. R is great for many things, but text manipulation is not one of those. Python could be a good alternative.
也就是说,如果我要在 R 中实现它,我可能会做这样的事情(非常非常粗略):
That said, if I were to implement this in R, I would probably do something like (very very rough):
# You will probably read these from an external file or a database
goodWords <- c("candesartan cilexetil", "glyburide", "nifedipine", "digoxin", "blabla", "warfarin", "hydrochlorothiazide")
badWords <- c("no significant", "other drugs")
mytext <- "no significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide."
mytext <- tolower(mytext) # Let's make life a little bit easier...
goodPos <- NULL
badPos <- NULL
# First we find the good words
for (w in goodWords)
{
pos <- regexpr(w, mytext)
if (pos != -1)
{
cat(paste(w, "found at position", pos, "\n"))
}
else
{
pos <- NA
cat(paste(w, "not found\n"))
}
goodPos <- c(goodPos, pos)
}
# And then the bad words
for (w in badWords)
{
pos <- regexpr(w, mytext)
if (pos != -1)
{
cat(paste(w, "found at position", pos, "\n"))
}
else
{
pos <- NA
cat(paste(w, "not found\n"))
}
badPos <- c(badPos, pos)
}
# Note that we use -badPos so that when can calculate the distance with rowSums
comb <- expand.grid(goodPos, -badPos)
wordcomb <- expand.grid(goodWords, badWords)
dst <- cbind(wordcomb, abs(rowSums(comb)))
mn <- which.min(dst[,3])
cat(paste("The closest good-bad word pair is: ", dst[mn, 1],"-", dst[mn, 2],"\n"))
这篇关于跟踪单词邻近度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!