计算 R 中每行文本数据的 ngrams [英] Compute ngrams for each row of text data in R

查看：40 发布时间：2021/9/6 19:44:37 r text text-parsing n-gram tm

本文介绍了计算 R 中每行文本数据的 ngrams的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个以下格式的数据列:

I have a data column of the following format:

文字

Hello world  
Hello  
How are you today  
I love stackoverflow  
blah blah blahdy

我想通过使用 tau 包的 textcnt() 函数来计算这个数据集中每一行的 3-gram.但是，当我尝试它时，它给了我一个数字向量，其中包含整个列的 ngram.如何分别将此函数应用于数据中的每个观察?

I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately?

推荐答案

这就是你想要的吗?

library("RWeka")
library("tm")

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                Weka_control(min = 3, max = 3))
# Using Tyler's method of making the 'Text' object here
tdm <- TermDocumentMatrix(Corpus(VectorSource(Text)), 
                          control = list(tokenize = TrigramTokenizer))

inspect(tdm)

A term-document matrix (4 terms, 5 documents)

Non-/sparse entries: 4/16
Sparsity           : 80%
Maximal term length: 20 
Weighting          : term frequency (tf)

                      Docs
Terms                  1 2 3 4 5
  are you today        0 0 1 0 0
  blah blah blahdy     0 0 0 0 1
  how are you          0 0 1 0 0
  i love stackoverflow 0 0 0 1 0

这篇关于计算 R 中每行文本数据的 ngrams的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

计算 R 中每行文本数据的 ngrams [英] Compute ngrams for each row of text data in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

计算 R 中每行文本数据的 ngrams [英] Compute ngrams for each row of text data in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭