跟踪单词邻近度 [英] Keeping Track of Word Proximity

查看:19
本文介绍了跟踪单词邻近度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个小项目,该项目涉及在文档集合中进行基于字典的文本搜索.我的字典有积极的信号词(又名好词),但在文档集合中,仅仅找到一个词并不能保证肯定的结果,因为可能有负面词,例如(不,不重要)可能与这些积极词接近.我想构建一个矩阵,使其包含文档编号、正词及其与负词的接近度.

I am working on a small project which involves a dictionary based text searching within a collection of documents. My dictionary has positive signal words (a.k.a good words) but in the document collection just finding a word does not guarantee a positive result as there may be negative words for example (not, not significant) that may be in the proximity of these positive words. I want to construct a matrix such that it contains the document number,positive word and its proximity to negative words.

任何人都可以提出一种方法来做到这一点.我的项目处于非常非常早期的阶段,所以我给出了我的文本的一个基本示例.

Can anyone please suggest a way to do that. My project is at a very very early stage so I am giving a basic example of my text.

No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.   

这是我的示例文档,其中坎地沙坦酯、格列本脲、硝苯地平、地高辛、华法林、氢氯噻嗪是我的肯定词,我的否定词没有显着意义.我想在我的正面词和负面词之间做一个接近度(基于词的)映射.

This is my example document in which candesartan cilexetil, glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide are my positive words and no significant is my negative word. I want to do a proximity (word based) mapping between my positive and nevative words.

谁能提供一些有用的提示?

Can anyone give some helpful pointers?

推荐答案

首先,我建议不要将 R 用于此任务.R 在许多方面都很棒,但文本操作不是其中之一.Python 可能是一个不错的选择.

First of all I would suggest not to use R for this task. R is great for many things, but text manipulation is not one of those. Python could be a good alternative.

也就是说,如果我要在 R 中实现它,我可能会做这样的事情(非常非常粗略):

That said, if I were to implement this in R, I would probably do something like (very very rough):

# You will probably read these from an external file or a database
goodWords <- c("candesartan cilexetil", "glyburide", "nifedipine", "digoxin", "blabla", "warfarin", "hydrochlorothiazide")
badWords <- c("no significant", "other drugs")

mytext <- "no significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide."
mytext <- tolower(mytext) # Let's make life a little bit easier...

goodPos <- NULL
badPos <- NULL

# First we find the good words
for (w in goodWords)
    {
    pos <- regexpr(w, mytext)
    if (pos != -1)
        {
        cat(paste(w, "found at position", pos, "\n"))
        }
    else    
        {
        pos <- NA
        cat(paste(w, "not found\n"))
        }

    goodPos <- c(goodPos, pos)
    }

# And then the bad words
for (w in badWords)
    {
    pos <- regexpr(w, mytext)
    if (pos != -1)
        {
        cat(paste(w, "found at position", pos, "\n"))
        }
    else    
        {
        pos <- NA
        cat(paste(w, "not found\n"))
        }

    badPos <- c(badPos, pos)
    }

# Note that we use -badPos so that when can calculate the distance with rowSums
comb <- expand.grid(goodPos, -badPos)
wordcomb <- expand.grid(goodWords, badWords)
dst <- cbind(wordcomb, abs(rowSums(comb)))

mn <- which.min(dst[,3])
cat(paste("The closest good-bad word pair is: ", dst[mn, 1],"-", dst[mn, 2],"\n"))

这篇关于跟踪单词邻近度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆