在 R 中使用 N-Grams 创建文档术语矩阵 [英] Create Document Term Matrix with N-Grams in R

查看:69
本文介绍了在 R 中使用 N-Grams 创建文档术语矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用tm"包在 R 中创建 DocumentTermMatrix.它适用于一克,但我正在尝试使用 tm 包和来自的 tokenize_ngrams 函数创建 N-Grams 的 DocumenttermMatrix(N = 3)标记器"包.但我无法创建它.

I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package. But im not able to create it.

我搜索了可能的解决方案,但没有得到太多帮助.出于隐私原因,我无法共享数据.这是我尝试过的,

I searched for possible solution but i didnt get much help. For privacy reasons i can not share the data. Here is what i have tried,

library(tm)  
library(tokenizers)

data 是一个大约有 4.5k 行和 2 列的数据框,即doc_id"和text"

data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"

data_corpus = Corpus(DataframeSource(data))

用于 n-gram 标记化的自定义函数:

custom function for n-gram tokenization :

ngram_tokenizer = function(x){
  temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
  return(temp)
}

DTM 创建的控制列表:
1 克

control list for DTM creation :
1-gram

control_list_unigram = list(tokenize = "words",
                          removePunctuation = FALSE,
                          removeNumbers = FALSE, 
                          stopwords = stopwords("english"), 
                          tolower = T, 
                          stemming = T, 
                          weighting = function(x)
                            weightTf(x)
)

用于 N-gram 标记化

for N-gram tokenization

control_list_ngram = list(tokenize = ngram_tokenizer,
                    removePunctuation = FALSE,
                    removeNumbers = FALSE, 
                    stopwords = stopwords("english"), 
                    tolower = T, 
                    stemming = T, 
                    weighting = function(x)
                      weightTf(x)
                    )


dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

dim(dtm_unigram)
dim(dtm_ngram)

两个 dtm 的尺寸相同.
请指正!

The dimension of both the dtm's were same.
Please correct me!

推荐答案

不幸的是,tm 有一些令人讨厌且并不总是清晰的怪癖.首先,标记化似乎不适用于 corpera 创建的 Corpus.为此,您需要使用 VCorpus.

Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.

因此将行 data_corpus = Corpus(DataframeSource(data)) 更改为 data_corpus = VCorpus(DataframeSource(data)).

So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).

这是解决的一个问题.现在语料库将用于标记化,但现在您将遇到 tokenize_ngrams 的问题.您将收到以下错误:

That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:

Input must be a character vector of any length or a list of character
  vectors, each of which has a length of 1. 

当您运行此行时:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

为了解决这个问题,并且不依赖分词器包,您可以使用以下函数来分词数据.

To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.

NLP_tokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
}

这使用 NLP 包中的 ngrams 函数,该函数在您加载 tm 包时加载.1:3 告诉它创建 1 到 3 个单词的 ngram.所以你的 control_list_ngram 应该是这样的:

This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:

control_list_ngram = list(tokenize = NLP_tokenizer,
                          removePunctuation = FALSE,
                          removeNumbers = FALSE, 
                          stopwords = stopwords("english"), 
                          tolower = T, 
                          stemming = T, 
                          weighting = function(x)
                            weightTf(x)
                          )

我个人会使用 quanteda 包来完成所有这些工作.但现在这应该对你有所帮助.

Personally I would use the quanteda package for all of this work. But for now this should help you.

这篇关于在 R 中使用 N-Grams 创建文档术语矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆