如何使用R在语料库中搜索特定的n-gram [英] How to search for specific n-grams in a corpus using R

查看:133
本文介绍了如何使用R在语料库中搜索特定的n-gram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找语料库中的特定n-gram.假设我想在文档集合中找到资产管理"和历史收益".

I'm looking for specific n-grams in a corpus. Let's say I want to find 'asset management' and 'historical yield' in a collection of documents.

这就是我加载语料库的方式

This is how I loaded the corpus

my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"), 
                 readerControl = list(reader = readPDF)

我清理了语料库,并使用文档术语矩阵进行了一些基本计算.现在,我想查找特定的表达式并将其放在数据框中.这就是我使用的(感谢phiver):

I cleaned the corpus and did some basic calculations using document term matrices. Now I want to look for particular expressions and put them in a dataframe. This is what I use (thanks to phiver):

ngrams <- c('asset management', 'historical yield')
dtm_ngrams <- DocumentTermMatrix(my_corpus, control = list(dictionary = ngrams))
df_ngrams <- data.frame(Docs = dtm$dimnames$Docs, as.matrix(dtm_ngrams), row.names = NULL )

此代码运行,但是两个n-gram的结果均为0.因此,我猜测问题是库没有正确定义,因为R不会占用单词之间的空格.到目前为止,我试图将''放在单词之间,或[:space:]与其他一些解决方案之间.

This code runs, but the result is 0 for both n-grams. So, I'm guessing the problem is that the library is not defined correctly because R doesn't pick up the space between the words. So far, I tried to put '' between the words, or [:space:] and some other solutions.

推荐答案

未经进一步处理的文档术语矩阵仅包含单个单词(以及nchar 3或更多的单词).如果要使用双字母组,则需要创建一个双字母组(或uni和双字母组)的术语矩阵.

A document term matrix without any further manipulation contains only single words (and words of nchar 3 or more). If you want to have bigrams, you need to create a term matrix of bigrams (or uni and bigrams).

根据您的示例,仅使用tm和在调用tm时即加载的NLP,我们可以制作一个bigram标记器.或多克,请参见代码中的注释.

Based on your example and using just tm and NLP which is loaded as soon as you call tm, we can make a bigram tokenizer. Or multi-gram, see comment in code.

再次使用内置的原始数据集.

Using the built in crude data set again.

library(tm)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

# This tokenizer is built on NLP and creates bigrams. 
# If you want multi-grams specify 1:2 for uni- and bi-gram, 
# 2:3 for bi- and trigram, 1:3 for uni-, bi- and tri-grams.
# etc. etc. ...(ngrams(words(x), 1:3)...

bigram_tokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}

my_words <- c("crude oil", "west texas")

dtm <- DocumentTermMatrix(crude, control=list(tokenizer = bigram_tokenizer, dictionary = my_words))

inspect(dtm)
<<DocumentTermMatrix (documents: 20, terms: 2)>>
Non-/sparse entries: 11/29
Sparsity           : 72%
Maximal term length: 10
Weighting          : term frequency (tf)
Sample             :
     Terms
Docs  crude oil west texas
  127         2          1
  144         0          0
  191         2          0
  194         1          2
  211         0          0
  273         2          0
  349         1          0
  353         1          0
  543         1          1
  708         1          0

此后,您可以按照 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆