如何在 tm 中仅为 TermDocumentMatrix 创建选择语料库术语的子集 [英] How to select only a subset of corpus terms for TermDocumentMatrix creation in tm
问题描述
我有一个庞大的语料库,我只对我预先知道的少数术语的外观感兴趣.有没有办法使用 tm
包从语料库创建术语文档矩阵,其中只使用和包含我预先指定的术语?
I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm
package, where only terms I specify up front are to be used and included?
我知道我可以对语料库的结果 TermDocumentMatrix 进行子集化,但由于内存大小限制,我想避免构建完整的术语文档矩阵.
I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint.
推荐答案
您可以通过构建自定义转换函数来修改语料库以仅保留您想要的术语.请参阅 tm
包的插图 和 content_transformer
函数的帮助以获取更多信息:
You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm
package and the help for the content_transformer
function for more information:
library(tm)
# Create a corpus from the text listed below
corp = VCorpus(VectorSource(doc))
# Custom function to keep only the terms in "pattern" and remove everything else
(f <- content_transformer(function(x, pattern)
regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE))))
(仅供参考,上面的第二行代码改编自 this SO answer.)
(FYI, the second line of code just above is adapted from this SO answer.)
# The pattern we'll search for
keep = "sleep|dream|die"
# Run the transformation function using the pattern above
tm_map(corp, f, keep)[[1]]
这是运行转换函数的结果:
Here's the result of running the transformation function:
<<PlainTextDocument (metadata: 7)>>
c("die", "sleep", "sleep", "die", "sleep", "sleep", "Dream")
这是我用来创建语料库的原文:
Here's the original text I used to create the corpus:
doc = "To be, or not to be, that is the question—
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing, end them? To die, to sleep—
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub"
这篇关于如何在 tm 中仅为 TermDocumentMatrix 创建选择语料库术语的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!