R:转换“术语文档矩阵"到“语料库" [英] R: Convert a "Term Document Matrix" to a "Corpus"

查看:98
本文介绍了R:转换“术语文档矩阵"到“语料库"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 R 编程语言.我正在尝试按照本教程中的说明进行操作(https://cran.r-project.org/web/packages/tidytext/vignettes/tidying_casting.html) 并学习如何将术语文档矩阵"转换为进入语料库".但是,本教程中提供的解释我不清楚,我不知道该怎么做.

I am using the R programming language. I am trying to follow the instructions from this tutorial over here (https://cran.r-project.org/web/packages/tidytext/vignettes/tidying_casting.html) and learn how to convert a "term document matrix" into a "corpus". However, the explanations provided in this tutorial are unclear to me, and I am not sure how to do this.

使用公开可用的莎士比亚戏剧,我创建了术语文档矩阵,如下所示:

Using publicly available Shakespeare Plays, I created the term document matrix as follows:

#load libraries
library(dplyr)
library(pdftools)
library(tidytext)
library(textrank)
library(tm)

#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_1 <- article_words %>%
  anti_join(stop_words, by = "word")

#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_2<- article_words %>%
  anti_join(stop_words, by = "word")


#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_3 <- article_words %>%
  anti_join(stop_words, by = "word")

从这里,我创建了实际的术语文档矩阵":

From here, I create the actual "term document matrix":

library(tm)

#create term document matrix
tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_words_1, article_words_2, article_words_3))))

#inspect the "term document matrix" (I don't know why this is producing an error)
inspect(tdm)

现在,我不确定如何使用本教程中的说明(https://cran.r-project.org/web/packages/tidytext/vignettes/tidying_casting.html) 并转换术语文档矩阵"进入语料库".

Now, I am unsure how to use the instructions from this tutorial (https://cran.r-project.org/web/packages/tidytext/vignettes/tidying_casting.html) and convert the "Term Document Matrix" into a "Corpus".

这是应该的吗?

library(quanteda)
d <- quanteda::dfm(tdm, verbose = FALSE)

有人可以告诉我如何解决这个问题吗?

Can someone please show me how to solve this problem?

谢谢

推荐答案

回答您在 Ronak 回答的评论中提出的问题.

To answer the question you have in the comments of Ronak's answer.

您无法将 tdm 转换为语料库,因为 tdm 已经在文档中汇总了字数,并且您丢失了句子的顺序.使用 quanteda,您可以对 dfm 执行多项操作,例如替换单词或删除停用词.请参阅基于第一个莎士比亚文本的示例以显示该过程.

You can't transform a tdm into a corpus as a tdm already has the word counts aggregated in the document and you lost the order of the sentences. Using quanteda you can do several actions on the dfm like replacing words or removing stopwords. See example based on the first shakespeare text to show the process.

library(dplyr)
library(tidytext)
library(pdftools)

library(quanteda)

url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)

# added count to simulate a dtm
article_words <- article_sentences %>%
  unnest_tokens(word, sentence) %>% 
  group_by(sentence_id, word) %>% 
  summarise(count = n())

# cast into a quanteda dfm
my_dfm <- cast_dfm(article_words, sentence_id, word, count)

# using quanteda's dfm_remove to remove stopwords from a dfm.
my_dfm <- dfm_remove(my_dfm, stopwords())

使用dfm_replace,您可以替换某些可能有错误和/或标点符号的单词.之后您可以使用 dfm_compress 将具有相同名称的特征合并为 1 个特征.

with dfm_replace you could replace certain words that might have mistakes and or punctuations you do not want to have. Afterwards you can use dfm_compress to combine features that have the same name into 1 feature.

但是您最好尝试获取原始数据,而不是从 tdm 开始.

But you are better of to try to get the original data instead of starting of with a tdm.

这篇关于R:转换“术语文档矩阵"到“语料库"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆