在R中使用tm包查找关键短语 [英] finding key phrases using tm package in r

查看:86
本文介绍了在R中使用tm包查找关键短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个项目,要求我搜索各个公司的年度报告并在其中找到关键短语。我已将报告转换为文本文件,创建并清理了语料库。然后,我创建了一个文档术语矩阵。 tm_term_score函数似乎仅适用于单个单词而不适用于短语。是否可以在语料库中搜索关键短语(不一定是最常用的短语)?

I have a project requiring me to search annual reports of various companies and find key phrases in them. I have converted the reports to text files, created and cleaned a corpus. I then created a document term matrix. The tm_term_score function only seems to work for single words and not phrases. Is it possible to search the corpus for key phrases (not necessarily the most frequent)?

例如-

我想查看每个文档中供应链财务一词的次数在语料库中。但是,当我使用tm_term_score运行代码时-它返回没有文档包含该短语的信息。

I want to see how many times the phrase "supply chain finance" in each document in the corpus. However when I run the code using tm_term_score - it returns that no documents had the phrase.. When they in fact did.

我的进度如下

library(tm)
library(stringr)

setwd(‘C:/Users/Desktop/Annual Reports’)

dest<-"C:/Users/Desktop/Annual Reports"

a<-Corpus(DirSource("C:/Users/Desktop/Annual Reports"), readerControl ≈ list (language ≈"lat"))

a<-tm_map(a, removeNumbers)
a<-tm_map(a, removeWords, stopwords("english"))
a<-tm_map(a, removePunctuation)
a<-tm_map(a, stripWhitespace)

tokenizing.phrases<-c("supply growth","import revenues", "financing projects") 






我非常虚弱,对r还是陌生的,无法决定如何在我的语料库中搜索这些关键短语。


I am quite weak and new to r and cannot decifier how to search my corpus for these key phrases.

推荐答案

以下内容可能会为您提供帮助。

Perhaps something like the following will help you.

第一个,使用您的关键短语创建一个对象,例如

First, create an object with your key phrases, such as

tokenizing.phrases <- c("general counsel", "chief legal officer", "inside counsel", "in-house counsel",
                        "law department", "law dept", "legal department", "legal function",
                        "law firm", "law firms", "external counsel", "outside counsel",
                        "law suit", "law suits", # can be hyphenated, eg.
                        "accounts payable", "matter management")

然后使用此功能(可能会根据需要进行调整)。

Then use this function (perhaps with tweaks for your needs).

phraseTokenizer <- function(x) {  
  require(stringr)

  x <- as.character(x) # extract the plain text from the tm TextDocument object
  x <- str_trim(x)
  if (is.na(x)) return("")
  #warning(paste("doing:", x))
  phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))

  if (any(phrase.hits)) {
    # only split once on the first hit, so not to worry about multiple occurrences of the same phrase
    split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] 
    # warning(paste("split phrase:", split.phrase))
    temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
    out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) # this is recursive, since f() calls itself
  } else {
    out <- MC_tokenizer(x)
  }

  # get rid of any extraneous empty strings, which can happen if a phrase occurs just before a punctuation
  out[out != ""]
}

然后使用包含的短语创建您的学期文档矩阵。

Then create your term document matrix with the phrases included in it.

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))

这篇关于在R中使用tm包查找关键短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆