如何计算单词与文档中特定术语的接近度 [英] How to calculate proximity of words to a specific term in a document

查看:24
本文介绍了如何计算单词与文档中特定术语的接近度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出一种方法来计算文档中特定术语的单词邻近度以及平均邻近度(按单词).我知道在 SO 上也有类似的问题,但没有任何东西可以给我所需的答案,甚至不能为我指出有用的地方.假设我有以下文本:

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me somewhere helpful. So let's say I have the following text:

song <- "Far over the misty mountains cold To dungeons deep and caverns old We 
must away ere break of day To seek the pale enchanted gold. The dwarves of 
yore made mighty spells, While hammers fell like ringing bells In places deep, 
where dark things sleep, In hollow halls beneath the fells. For ancient king 
and elvish lord There many a gleaming golden hoard They shaped and wrought, 
and light they caught To hide in gems on hilt of sword. On silver necklaces 
they strung The flowering stars, on crowns they hung The dragon-fire, in 
twisted wire They meshed the light of moon and sun. Far over the misty 
mountains cold To dungeons deep and caverns old We must away, ere break of 
day, To claim our long-forgotten gold. Goblets they carved there for 
themselves And harps of gold; where no man delves There lay they long, and 
many a song Was sung unheard by men or elves. The pines were roaring on the 
height, The winds were moaning in the night. The fire was red, it flaming 
spread; The trees like torches blazed with light. The bells were ringing in 
the dale And men they looked up with faces pale; The dragon’s ire more fierce 
than fire Laid low their towers and houses frail. The mountain smoked beneath 
the moon; The dwarves they heard the tramp of doom. They fled their hall to 
dying fall Beneath his feet, beneath the moon. Far over the misty mountains 
grim To dungeons deep and caverns dim We must away, ere break of day,
To win our harps and gold from him!"

我希望能够看到在单词fire"(也可以互换)两侧(左侧 15 个,右侧 15 个)的 15 个(我希望这个数字可以互换)单词内出现哪些单词每次出现.对于每个fire"实例,我想查看每个单词以及它在这 15 个单词跨度中出现的次数.因此,例如,火"被使用了 3 次.在这 3 次中,光"这个词在两​​边的 15 个词之内有两次.我想要一个表格来显示这个词、它在 15 的指定邻近范围内出现的次数、最大距离(在本例中为 12)、最小距离(即 7)和平均距离(即是 9.5).

I want to be able to see what words appear within 15 (I would like this number to be interchangeable) words on either side (15 to the left and 15 to the right) of the word "fire" (also interchangeable) every time it appears. I want to see each word and the number of times it appears in this 15 word span for each instance of "fire." So, for example, "fire" is used 3 times. Of those 3 times the word "light" falls within 15 words on either side twice. I would want a table that shows the word, the number of times it appears within the specified proximity of 15, the maximum distance (which in this case is 12), the minimum distance (which is 7), and the average distance (which is 9.5).

我想我需要几个步骤和包来完成这项工作.我的第一个想法是使用 quanteda 的kwic"函数,因为它允许您围绕特定术语选择一个窗口".然后,基于 kwic 结果的术语频率计数并不那么难(针对频率移除了停用词,但没有针对单词邻近度量).我真正的问题是找到与焦点术语的最大、最小和平均距离,然后将结果放入一个漂亮整洁的表格中,这些术语作为按频率降序排列的行和列给我频率计数、最大距离、最小距离和平均距离.

I figured I would need several steps and packages to make this work. My first thought was to use the "kwic" function from quanteda since it allows you to choose a "window" around a specific term. Then a frequency count of terms based on the kwic results is not that hard (with stopwords removed for the frequency, but not for the word proximity measure). My real problem is finding the maximum, minimum, and average distances from the focus term and then getting the results into a nice neat table with the terms as rows in descending order by frequency and the columns giving me the frequency count, max distance, minimum distance, and average distance.

这是我目前所拥有的:

library(quanteda)
library(tm)

mysong <- char_tolower(song)

toks <- tokens(mysong, remove_hyphens = TRUE, remove_punct = TRUE, 
remove_numbers = TRUE, remove_symbols = TRUE)

mykwic <- kwic(toks, "fire", window = 15, valuetype ="fixed")
thekwic <- as.character(mykwic)

thekwic <- removePunctuation(thekwic)
thekwic <- removeNumbers(thekwic)
thekwic <- removeWords(thekwic, stopwords("en"))

kwicFreq <- termFreq(thekwic)

非常感谢任何帮助.

推荐答案

我建议结合我的 tidytext 解决这个问题fuzzyjoin 包.

I'd suggest solving this with a combination of my tidytext and fuzzyjoin packages.

您可以首先将其标记为一行一行的数据框,添加一个 position 列,并删除停用词:

You can start by tokenizing it into a one-row-per-word data frame, adding a position column, and removing stopwords:

library(tidytext)
library(dplyr)

all_words <- data_frame(text = song) %>%
  unnest_tokens(word, text) %>%
  mutate(position = row_number()) %>%
  filter(!word %in% tm::stopwords("en"))

然后您可以只找到单词 fire,并使用 Fuzzyjoin 中的 difference_inner_join() 来查找这些行的 15 个单词内的所有行.然后,您可以使用 group_by()summarize() 来获取每个单词所需的统计信息.

You can then find just the word fire, and use difference_inner_join() from fuzzyjoin to find all rows within 15 words of those rows. You can then use group_by() and summarize() to get your desired statistics for each word.

library(fuzzyjoin)

nearby_words <- all_words %>%
  filter(word == "fire") %>%
  select(focus_term = word, focus_position = position) %>%
  difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 15) %>%
  mutate(distance = abs(focus_position - position))

words_summarized <- nearby_words %>%
  group_by(word) %>%
  summarize(number = n(),
            maximum_distance = max(distance),
            minimum_distance = min(distance),
            average_distance = mean(distance)) %>%
  arrange(desc(number))

本例中的输出:

# A tibble: 49 × 5
       word number maximum_distance minimum_distance average_distance
      <chr>  <int>            <dbl>            <dbl>            <dbl>
 1     fire      3                0                0              0.0
 2    light      2               12                7              9.5
 3     moon      2               13                9             11.0
 4    bells      1               14               14             14.0
 5  beneath      1               11               11             11.0
 6   blazed      1               10               10             10.0
 7   crowns      1                5                5              5.0
 8     dale      1               15               15             15.0
 9   dragon      1                1                1              1.0
10 dragon’s      1                5                5              5.0
# ... with 39 more rows

请注意,此方法还可让您一次对多个焦点词进行分析.您所要做的就是将 filter(word == "fire") 更改为 filter(word %in% c("fire", "otherword")),并将 group_by(word) 更改为 group_by(focus_word, word).

Note that this approach also lets you perform the analysis on multiple focus words at once. All you'd have to do is change filter(word == "fire") to filter(word %in% c("fire", "otherword")), and change group_by(word) to group_by(focus_word, word).

这篇关于如何计算单词与文档中特定术语的接近度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆