从具有多个文档的语料库中删除行 [英] Removing rows from Corpus with multiple documents

查看:23
本文介绍了从具有多个文档的语料库中删除行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的语料库中有 4000 个文本文档.作为数据清理的一部分,我想从每个文档中删除包含特定单词的行.

I have 4000 text documents in corpus. I want to remove row(s) that contains a specific word from each document as a part of data clean up.

例如:

library(tm)
doc.corpus<-  VCorpus(DirSource("C:\\TextMining\\Prototype",pattern="*.txt",encoding= "UTF8",mode = "text"),readerControl=list(language="en"))

doc.corpus<- tm_map(doc.corpus, PlainTextDocument)

doc.corpus[[1]]

#PlainTextDocument
Metadata:  7
Content:  chars: 16542

    as.character(doc.corpus)[[1]]


$content


"Quick to deploy, easy to use, and offering complete investment
protection,   our product is clearly differentiated from all
competitive offerings by its common, modular platform, seamless
integration, broad range of support to heterogeneous products from
Microsoft,Apple, Oracle and unequalled scalability, support for
industry standards, and business application-to-storage system
correlation capabilities."
"Microsoft is U.S. registered trademarks of Microsoft Corporation, Oracle is a U.S. registered trademarks of Oracle Corporation and Apple
is a U.S. registered trademarks of Apple Corporation."

我的问题是从本文档和所有其他文档中删除包含商标"一词的第二行.目前,我使用 grepl() 函数来识别行,并尝试使用通常在处理数据框时使用的方法排除这些行,但该方法不起作用:

My problem is to remove 2nd row that contains word "trademark" from this and all other documents. Currently I used grepl() function to identify the rows and tried to exclude those rows using an approach that is typically used while working with data frame, which did not work:

corpus.copy<-corpus.doc
corpus.doc[[1]]<-corpus.copy[[1]][!grepl("trademark",as.character(corpus.copy[[1]]),ignore.case = TRUE),]

只要对第一个文档有效,我就可以轻松地使用for循环"在语料库内的所有文档中实现.

As long as it works for the first document, I could easily use "for loop" to implement in all documents within Corpus.

感谢任何提示/解决方案.我可以通过将语料库转换为数据框来删除不需要的行并再次转换回语料库,从而轻松使用替代路线.谢谢.

Any hints/solution is appreciated. I could have easily used alternative route by converting Corpus to data frame to remove the undesirable rows and convert back to Corpus again. Thanks.

System.info:
[1] "x86_64-w64-mingw32"; 
[1] "R version 3.1.0 (2014-04-10)"
[1] tm_0.6-2 

推荐答案

不需要 for 循环 - 尽管长期以来 tm 的一个令人沮丧的功能是一旦文本就很难访问在语料库对象中.

No need for a for loop - although it's long been a frustrating feature of tm that it's hard to access the texts once they are in a corpus object.

我已经将您所说的行"解释为一个文档——所以上面的例子是两个行".如果情况并非如此,则需要(但很容易)调整此答案.

I've interpreted what you mean by "row" as a document - so the example above is two "rows". If this is not the case, this answer needs to be (but can easily be) adjusted.

试试这个:

txt <- c("Quick to deploy, easy to use, and offering complete investment
protection,   our product is clearly differentiated from all
competitive offerings by its common, modular platform, seamless
integration, broad range of support to heterogeneous products from
Microsoft,Apple, Oracle and unequalled scalability, support for
industry standards, and business application-to-storage system
correlation capabilities.",
"Microsoft is U.S. registered trademarks of Microsoft Corporation, Oracle is a U.S. registered trademarks of Oracle Corporation and Apple
is a U.S. registered trademarks of Apple Corporation.")

require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
newCorp <- VCorpus(VectorSource(textVector[-grep("trademark", textVector, 
                                                  ignore.case = TRUE)]))

newCorp 现在排除包含商标"的文档.请注意,如果您不需要 this 的复数形式(例如商标")

newCorp now excludes documents containing "trademark". Note that if you do not need plurals of this (e.g. "trademark")

这篇关于从具有多个文档的语料库中删除行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆