计算 R 中每行文本数据的 ngrams [英] Compute ngrams for each row of text data in R
本文介绍了计算 R 中每行文本数据的 ngrams的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个以下格式的数据列:
I have a data column of the following format:
文字
Hello world
Hello
How are you today
I love stackoverflow
blah blah blahdy
我想通过使用 tau
包的 textcnt()
函数来计算这个数据集中每一行的 3-gram.但是,当我尝试它时,它给了我一个数字向量,其中包含整个列的 ngram.如何分别将此函数应用于数据中的每个观察?
I would like to compute the 3-grams for each row in this dataset by perhaps using the tau
package's textcnt()
function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately?
推荐答案
这就是你想要的吗?
library("RWeka")
library("tm")
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 3, max = 3))
# Using Tyler's method of making the 'Text' object here
tdm <- TermDocumentMatrix(Corpus(VectorSource(Text)),
control = list(tokenize = TrigramTokenizer))
inspect(tdm)
A term-document matrix (4 terms, 5 documents)
Non-/sparse entries: 4/16
Sparsity : 80%
Maximal term length: 20
Weighting : term frequency (tf)
Docs
Terms 1 2 3 4 5
are you today 0 0 1 0 0
blah blah blahdy 0 0 0 0 1
how are you 0 0 1 0 0
i love stackoverflow 0 0 0 1 0
这篇关于计算 R 中每行文本数据的 ngrams的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文