在 R 中使用 N-Grams 创建文档术语矩阵 [英] Create Document Term Matrix with N-Grams in R
问题描述
我正在使用tm"包在 R 中创建 DocumentTermMatrix.它适用于一克,但我正在尝试使用 tm 包和来自的 tokenize_ngrams 函数创建 N-Grams 的 DocumenttermMatrix(N = 3)标记器"包.但我无法创建它.
I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package. But im not able to create it.
我搜索了可能的解决方案,但没有得到太多帮助.出于隐私原因,我无法共享数据.这是我尝试过的,
I searched for possible solution but i didnt get much help. For privacy reasons i can not share the data. Here is what i have tried,
library(tm)
library(tokenizers)
data 是一个大约有 4.5k 行和 2 列的数据框,即doc_id"和text"
data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"
data_corpus = Corpus(DataframeSource(data))
用于 n-gram 标记化的自定义函数:
custom function for n-gram tokenization :
ngram_tokenizer = function(x){
temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
return(temp)
}
DTM 创建的控制列表:
1 克
control list for DTM creation :
1-gram
control_list_unigram = list(tokenize = "words",
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
用于 N-gram 标记化
for N-gram tokenization
control_list_ngram = list(tokenize = ngram_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
dim(dtm_unigram)
dim(dtm_ngram)
两个 dtm 的尺寸相同.
请指正!
The dimension of both the dtm's were same.
Please correct me!
推荐答案
不幸的是,tm 有一些令人讨厌且并不总是清晰的怪癖.首先,标记化似乎不适用于 corpera 创建的 Corpus
.为此,您需要使用 VCorpus
.
Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus
. You need to use VCorpus
for this.
因此将行 data_corpus = Corpus(DataframeSource(data))
更改为 data_corpus = VCorpus(DataframeSource(data))
.
So change the line data_corpus = Corpus(DataframeSource(data))
to data_corpus = VCorpus(DataframeSource(data))
.
这是解决的一个问题.现在语料库将用于标记化,但现在您将遇到 tokenize_ngrams
的问题.您将收到以下错误:
That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams
. You will get the following error:
Input must be a character vector of any length or a list of character
vectors, each of which has a length of 1.
当您运行此行时:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
为了解决这个问题,并且不依赖分词器包,您可以使用以下函数来分词数据.
To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.
NLP_tokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
}
这使用 NLP 包中的 ngrams
函数,该函数在您加载 tm 包时加载.1:3 告诉它创建 1 到 3 个单词的 ngram.所以你的 control_list_ngram 应该是这样的:
This uses the ngrams
function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:
control_list_ngram = list(tokenize = NLP_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
我个人会使用 quanteda 包来完成所有这些工作.但现在这应该对你有所帮助.
Personally I would use the quanteda package for all of this work. But for now this should help you.
这篇关于在 R 中使用 N-Grams 创建文档术语矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!