无法使用 R XML 包从抓取的 HTML 页面中提取文本 [英] Unable to pull text out of a scraped HTML page with R XML package

查看:24
本文介绍了无法使用 R XML 包从抓取的 HTML 页面中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图提取纽约时报电影评论的正文,以便对它们进行一些语义分析.不幸的是,我的 HTML+R+XML 打包技能不足以完成这项工作.我可以使用 NYT 电影 API 的 XML 输出来获取电影详细信息,但我无法弄清楚如何使用文章 API 或直接抓取网页来访问评论正文.

I am trying to extract the body of new york times movie reviews in order to do some semantic analysis on them. Unfortunately my HTML+R+XML package skills are not enough to get the job done. I can use the XML output from the NYT movies API to get movie details, but I can't work out how to use either the article API or a straight webpage scrape, in order to get to the body of the review.

电影详细信息的工作代码:

WORKING code to the the movie details:

library(RCurl)
nyt.x.url<-'http://api.nytimes.com/svc/movies/v2/reviews/search.xml?query=The+Hangover&api-key=YOUR-OWN-FREE-API-KEY-GOES-HERE'
nyt.x.out<-getURLContent(nyt.x.url,curl=getCurlHandle())
library(XML)
a <- xmlTreeParse(nyt.x.url)
r <- xmlRoot(a)
# need to put the separate list items together into a mtrix, before they can be turned to a dataframe
nyt.df <- as.data.frame(stringsAsFactors=FALSE,
                    matrix(c(as.character(r[[4]][[1]][[1]][[1]])[6],  # display name
                             as.character(r[[4]][[1]][[3]][[1]])[6],  # rating - agrees with rotten tomatoes, but not imdb
                             as.character(r[[4]][[1]][[4]][[1]])[6],  # is it a critics pick
                             as.character(r[[4]][[1]][[5]][[1]])[6],  # is it a thousand best
                             as.character(r[[4]][[1]][[11]][[1]])[6],  # opening date
                             as.character(r[[4]][[1]][[15]][[1]][[1]])[6]),  # this is really the URL....
                           nrow=1,
                           ncol=6))

# now apply the right names
colnames(nyt.df) <- c("Title","MPAA-Rating", "Critics.Pick", "Thousand.Best", "Release.Date", "Article.URL")

然后我会使用这个电影详细信息的数据框来抓取评论网页并尝试抓取评论文本:

I would then use this dataframe of movie details, to grab the review web page and try to grab the review text:

nyt.review.out<-getURLContent(as.character(nyt.df[6]),curl=getCurlHandle())
a2 <- htmlTreeParse(nyt.review.url)

但我不知道如何获得评论的全文.当我尝试将 json API 用于文章时遇到了同样的问题(对 api 的 url 调用如下)

But I can't figure out how to get to the full text of the review. I run into the same issue when I try to use the json API for articles (the url call to the api is below)

nyt.review.url <-'http://api.nytimes.com/svc/search/v1/article?format=json&query=review+the+Hangover&begin_date=20090605&end_date=20090606&api-key=YOUR-OTHER-FREE-API-KEY-GOES-HERE'

nyt.review.url <- 'http://api.nytimes.com/svc/search/v1/article?format=json&query=review+the+Hangover&begin_date=20090605&end_date=20090606&api-key=YOUR-OTHER-FREE-API-KEY-GOES-HERE'

非常感谢任何帮助,但您需要注册自己的 API 密钥(我已从代码中删除了我的)

Any help is greatly appreciated, but you will need to register for your own API keys (I have removed mine from the code)

推荐答案

认为这可以满足您的需求.可能有一种方法可以直接从 API 中执行您想要的操作,但我没有对此进行调查.

I think this does what you want. There may be a way to do what you want directly from the API but I didn't investigate that.

# load package
library(XML)

# grabs text from new york times movie page. 
grab_nyt_text <- function(u) {
  doc <- htmlParse(u)
  txt <- xpathSApply(doc, '//div[@class="articleBody"]//p', xmlValue)
  txt <- paste(txt, collapse = "\n")
  free(doc)
  return(txt)
}


###--- Main ---###

# Step 1: api URL
nyt.x.url <- 'http://api.nytimes.com/svc/movies/v2/reviews/search.xml?query=The+Hangover&api-key=YOUR-OWN-FREE-API-KEY-GOES-HERE'

# Step 2: Parse XML of webpage pointed to by URL
doc <- xmlParse(nyt.x.url)

# Step 3: Parse XML and extract some values using XPath expressions
df <- data.frame(display.title = xpathSApply(doc, "//results//display_title", xmlValue), 
                 critics.pick = xpathSApply(doc, "//results//critics_pick", xmlValue),
                 thousand.best = xpathSApply(doc, "//results//thousand_best", xmlValue),
                 opening.date = xpathSApply(doc, "//results//opening_date", xmlValue),
                 url = xpathSApply(doc, "//results//link[@type='article']/url", xmlValue),
                 stringsAsFactors=FALSE)

df
#         display.title critics.pick thousand.best opening.date                                                                                           url
#1         The Hangover            0             0   2009-06-05                                       http://movies.nytimes.com/2009/06/05/movies/05hang.html
#2 The Hangover Part II            0             0   2011-05-26 http://movies.nytimes.com/2011/05/26/movies/the-hangover-part-ii-3-men-and-a-monkey-baby.html

# Step 4: clean up - remove doc from memory
free(doc)

# Step 5: crawl article links and grab text
df$text <- sapply(df$url, grab_nyt_text)

# Step 6: inspect txt
cat(df$text[1])

HTH

托尼·布雷亚尔

附言还有一个 R 包 http://www.omegahat.org/RNYTimes 但该网站在片刻,所以我不知道它的功能.

P.S. There's also an R package http://www.omegahat.org/RNYTimes but the website is down at the moment so I don't know what it's capable of.

这篇关于无法使用 R XML 包从抓取的 HTML 页面中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆