删除带有前导和尾随停用词的 ngram [英] Remove ngrams with leading and trailing stopwords

查看：74 发布时间：2021/9/6 19:42:18 r text-mining tm quanteda

本文介绍了删除带有前导和尾随停用词的 ngram的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在一堆学术论文中识别主要的 n-gram，包括带有嵌套停用词的 n-gram，但不包括带有前导或尾随停用词的 n-gram.

I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords.

我有大约 100 个 pdf 文件.我通过 Adobe 批处理命令将它们转换为纯文本文件，并将它们收集在一个目录中.从那里我使用 R.(这是一个拼凑的代码，因为我刚刚开始使用文本挖掘.)

I have about 100 pdf files. I converted them to plain-text files through an Adobe batch command and collected them within a single directory. From there I use R. (It's a patchwork of code because I'm just getting started with text mining.)

我的代码:

library(tm)
# Make path for sub-dir which contains corpus files 
path <- file.path(getwd(), "txt")
# Load corpus files
docs <- Corpus(DirSource(path), readerControl=list(reader=readPlain, language="en"))

#Cleaning
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)

# Merge corpus (Corpus class to character vector)
txt <- c(docs, recursive=T)

# Find trigrams (but I might look for other ngrams as well)
library(quanteda)
myDfm <- dfm(txt, ngrams = 3)
# Remove sparse features
myDfm <- dfm_trim(myDfm, min_count = 5)
# Display top features
topfeatures(myDfm)
#                  as_well_as             of_the_ecosystem                  in_order_to         a_business_ecosystem       the_business_ecosystem strategic_management_journal 
#603                          543                          458                          431                          431                          359 
#in_the_ecosystem        academy_of_management                  the_role_of                the_number_of 
#336                          311                          289                          276

例如，在此处提供的顶级 ngrams 示例中，我想保留管理学院"，但不是以及"，也不是the_role_of".我希望代码适用于任何 n-gram(最好包括小于 3-gram，尽管我知道在这种情况下先删除停用词更简单).

For example, in the top ngrams sample provided here, I'd want to keep "academy of management", but not "as well as", nor "the_role_of". I'd like the code to work for any n-gram (preferably including less than 3-grams, although I understand it's simpler in this case to just remove stopwords first).

推荐答案

以下是 quanteda 中的方法:使用 dfm_remove()，其中您要删除的模式是停止词列表后跟连接符，用于表达式的开头和结尾.(请注意，为了重现性，我使用了一个内置的文本对象.)

Here's how in quanteda: use dfm_remove(), where the pattern you want to remove is the stopword list followed by the concatenator character, for the beginning and end of the expression. (Note here that for reproducibility, I have used a built-in text object.)

library("quanteda")

# remove for your own txt
txt <- data_char_ukimmig2010

(myDfm <- dfm(txt, remove_numbers = TRUE, remove_punct = TRUE, ngrams = 3))
## Document-feature matrix of: 9 documents, 5,518 features (88.5% sparse).

(myDfm2 <- dfm_remove(myDfm, 
                     pattern = c(paste0("^", stopwords("english"), "_"), 
                                 paste0("_", stopwords("english"), "$")), 
                     valuetype = "regex"))
## Document-feature matrix of: 9 documents, 1,763 features (88.6% sparse).
head(featnames(myDfm2))
## [1] "immigration_an_unparalleled" "bnp_can_solve"               "solve_at_current"           
## [4] "immigration_and_birth"       "birth_rates_indigenous"      "rates_indigenous_british"

奖励答案:

您可以使用 readtext 包阅读您的 pdf，使用上述代码，它也适用于 quanteda.

Bonus answer:

You can read your pdfs using the readtext package, which also works just fine with quanteda using the above code.

library("readtext")
txt <- readtext("yourpdfolder/*.pdf") %>% corpus()

这篇关于删除带有前导和尾随停用词的 ngram的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

删除带有前导和尾随停用词的 ngram [英] Remove ngrams with leading and trailing stopwords

问题描述

推荐答案

奖励答案:

Bonus answer:

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

删除带有前导和尾随停用词的 ngram [英] Remove ngrams with leading and trailing stopwords

问题描述

推荐答案

奖励答案:

Bonus answer:

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭