R - 维基百科文章的自动分类 [英] R - Automatic categorization of Wikipedia articles

查看:96
本文介绍了R - 维基百科文章的自动分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在努力遵循这个示例 来自 Norbert Ryciak,我无法与他取得联系.

I have been trying to follow this example by Norbert Ryciak, whom I havent been able to get in touch with.

自从这篇文章写于 2014 年以来,R 中的一些内容发生了变化,所以我能够更新代码中的一些内容,但我卡在了最后一部分.

Since this article was written in 2014, some things in R have changed so I have been able to update some of those things in the code, but I got stuck in the last part.

这是我目前的工作代码:

Here is my Working code so far:

 library(tm)
 library(stringi)
 library(proxy)

 wiki <- "https://en.wikipedia.org/wiki/"

 titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral",  "Derivative",
  "Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko",
  "Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien")

 articles <- character(length(titles))

 for (i in 1:length(titles)) {
   articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ")
  }

 docs <- Corpus(VectorSource(articles))

 docs[[1]]
 docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
 docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))
 docs4 <- tm_map(docs3, PlainTextDocument)
 docs5 <- tm_map(docs4, stripWhitespace)
 docs6 <- tm_map(docs5, removeWords, stopwords("english"))
 docs7 <- tm_map(docs6, removePunctuation)
 docs8 <- tm_map(docs7, content_transformer(tolower))
 docs8[[1]]

 docsTDM <- TermDocumentMatrix(docs8)
 docsTDM2 <- as.matrix(docsTDM)
 docsdissim <- dist(docsTDM2, method = "cosine")

但是我没能通过这部分:

But I havent been able to get pass this part:

 docsdissim2 <- as.matrix(docsdissim)
 rownames(docsdissim2) <- titles
 colnames(docsdissim2) <- titles
 docsdissim2
 h <- hclust(docsdissim, method = "ward.D")
 plot(h, labels = titles, sub = "")

我尝试直接运行hclust",然后我能够进行绘图,但没有任何可读的结果.

I tried to run the "hclust" directly, and then I was able to Plot, but nothing readable came out of it.

这是我得到的错误:

 rownames(docsdissim2) <- titles
 Error in `rownames<-`(`*tmp*`, value = c("Integral", "Riemann_integral",  : 
   length of 'dimnames' [1] not equal to array extent

另一个:

 plot(h, labels = titles, sub = "")
 Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang,  : 
   invalid dendrogram input

有谁能帮我完成这个例子吗?

Is there anyone that could give me a hand to finish this example?

此致,

推荐答案

感谢 Norbert Ryciak(教程的作者),我能够解决这个问题.

I was able to solve this problem thanks to Norbert Ryciak (the author of the tutorial).

由于他使用了旧版本的tm"(当时可能是最新版本),因此与我使用的版本不兼容.

Since he used an older version of "tm" (which was probably the latest at the time) it was not compatible with the one I used.

解决方案是将docsTDM <- TermDocumentMatrix(docs8)"替换为docsTDM <- DocumentTermMatrix(docs8)".

The solution was to replace "docsTDM <- TermDocumentMatrix(docs8)" with "docsTDM <- DocumentTermMatrix(docs8)".

最后的代码:

 library(tm)
 library(stringi)
 library(proxy)

 wiki <- "https://en.wikipedia.org/wiki/"

 titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral",  "Derivative",
  "Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko",
  "Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien")

 articles <- character(length(titles))

 for (i in 1:length(titles)) {
   articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col =     " ")
  }

 docs <- Corpus(VectorSource(articles))

 docs[[1]]
 docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
 docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))
 docs4 <- tm_map(docs3, PlainTextDocument)
 docs5 <- tm_map(docs4, stripWhitespace)
 docs6 <- tm_map(docs5, removeWords, stopwords("english"))
 docs7 <- tm_map(docs6, removePunctuation)
 docs8 <- tm_map(docs7, content_transformer(tolower))
 docs8[[1]]

 docsTDM <- DocumentTermMatrix(docs8)
 docsTDM2 <- as.matrix(docsTDM)
 docsdissim <- dist(docsTDM2, method = "cosine")

 docsdissim2 <- as.matrix(docsdissim)
 rownames(docsdissim2) <- titles
 colnames(docsdissim2) <- titles
 docsdissim2
 h <- hclust(docsdissim, method = "ward")
 plot(h, labels = titles, sub = "")

这篇关于R - 维基百科文章的自动分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆