如何使用 quanteda 和 kwic 进行模糊模式匹配? [英] How to do fuzzy pattern matching with quanteda and kwic?

查看:26
本文介绍了如何使用 quanteda 和 kwic 进行模糊模式匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有医生写的文本,我希望能够突出显示上下文中的特定词(我在他们的文本中搜索的词之前的 5 个词和之后的 5 个词).假设我想搜索自杀"这个词.然后我将使用 quanteda 包中的 kwic 函数:

I have texts written by doctors and I want to be able to highlight specific words in their context (5 words before and 5 words after the word I search for in their text). Say I want to search for the word 'suicidal'. I would then use the kwic function in the quanteda package:

kwic(数据集,模式 = 自杀",窗口 = 5)

kwic(dataset, pattern = "suicidal", window = 5)

到目前为止,一切都很好,但我想考虑一下可能出现的拼写错误.在这种情况下,我希望允许三个不同的字符,而对它们在单词中的位置没有限制.

So far, so good, but say I want to allow for the possibility of typos. In this case I want to allow for three deviating characters, with no restriction on where in the word these are made.

是否可以使用 quanteda 的 kwic 函数来做到这一点?

Is it possible to do this using quanteda's kwic-function?

示例:

dataset <- data.frame("patient" = 1:9, "text" = c("On his first appointment, the patient was suicidal when he showed up in my office", 
                                  "On his first appointment, the patient was suicidaa when he showed up in my office",
                                  "On his first appointment, the patient was suiciaaa when he showed up in my office",
                                  "On his first appointment, the patient was suicaaal when he showed up in my office",
                                  "On his first appointment, the patient was suiaaaal when he showed up in my office",
                                  "On his first appointment, the patient was saacidal when he showed up in my office",
                                  "On his first appointment, the patient was suaaadal when he showed up in my office",
                                  "On his first appointment, the patient was icidal when he showed up in my office",
                                  "On his first appointment, the patient was uicida when he showed up in my office"))

dataset$text <- as.character(dataset$text)
kwic(dataset$text, pattern = "suicidal", window = 5)

只会给我第一个拼写正确的句子.

would only give me the first, correctly spelled, sentence.

推荐答案

很好的问题.我们没有近似匹配作为值类型",但这是未来发展的一个有趣想法.同时,我建议使用 base::agrep() 生成固定模糊匹配列表,然后对它们进行匹配.所以这看起来像:

Great question. We don't have approximate matching as a "valuetype" but that's an interesting idea for future development. In the meantime, I'd suggest generating a list of fixed fuzzy matches using base::agrep() and then matching on those. So this would look like:

library("quanteda")
## Package version: 1.5.2

dataset <- data.frame(
  "patient" = 1:9, "text" = c(
    "On his first appointment, the patient was suicidal when he showed up in my office",
    "On his first appointment, the patient was suicidaa when he showed up in my office",
    "On his first appointment, the patient was suiciaaa when he showed up in my office",
    "On his first appointment, the patient was suicaaal when he showed up in my office",
    "On his first appointment, the patient was suiaaaal when he showed up in my office",
    "On his first appointment, the patient was saacidal when he showed up in my office",
    "On his first appointment, the patient was suaaadal when he showed up in my office",
    "On his first appointment, the patient was icidal when he showed up in my office",
    "On his first appointment, the patient was uicida when he showed up in my office"
  ),
  stringsAsFactors = FALSE
)
corp <- corpus(dataset)

# get unique words
vocab <- tokens(corp, remove_numbers = TRUE, remove_punct = TRUE) %>%
  types()

使用 agrep() 生成最接近的模糊匹配 - 在这里我运行了几次,每次从默认值 0.1 稍微增加 max.distance.

The use agrep() to generate closest fuzzy matches - and here I ran tihs a few times, increasing max.distance each time slightly from the default of 0.1.

# get closest matches to "suicidal"
near_matches <- agrep("suicidal", vocab,
  max.distance = 0.3,
  ignore.case = TRUE, fixed = TRUE, value = TRUE
)
near_matches
## [1] "suicidal" "suicidaa" "suiciaaa" "suicaaal" "suiaaaal" "saacidal" "suaaadal"
## [8] "icidal"   "uicida"

然后,将其用作 kwic()pattern 参数:

Then, use this as the pattern argument to kwic():

# use these for fuzzy matching
kwic(corp, near_matches, window = 3)
##                                                        
##  [text1, 9] the patient was | suicidal | when he showed
##  [text2, 9] the patient was | suicidaa | when he showed
##  [text3, 9] the patient was | suiciaaa | when he showed
##  [text4, 9] the patient was | suicaaal | when he showed
##  [text5, 9] the patient was | suiaaaal | when he showed
##  [text6, 9] the patient was | saacidal | when he showed
##  [text7, 9] the patient was | suaaadal | when he showed
##  [text8, 9] the patient was |  icidal  | when he showed
##  [text9, 9] the patient was |  uicida  | when he showed

还有基于类似解决方案的其他可能性,例如 fuzzyjoinstringdist 包,但这是基础的简单解决方案应该可以很好地工作的软件包.

There are other possibilities based on similar solutions, for instance the fuzzyjoin or stringdist packages, but this is a simple solution from the base package that should work pretty well.

这篇关于如何使用 quanteda 和 kwic 进行模糊模式匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆