在 R 中抓取相关页面 [英] Scraping related pages in R

查看:25
本文介绍了在 R 中抓取相关页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从几个姐妹 URL 中抓取数据进行分析.以前的线程 Scraping网页、页面上的链接以及使用 R 形成表格有助于使用以下脚本使我走上正确的道路:

I am trying to scrape data from several sister URLs for analysis. A previous thread Scraping a web page, links on a page, and forming a table with R was helpful in getting me on the right path with the following script:

rm(list=ls())
library(XML)
library(RCurl) 

#=======2013========================================================================
url2013 = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url2013)
dummy2013 <- data.frame(
  dates = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlGetAttr,'href'),
  title = xpathSApply(doc, '//*[@class="link_info"]/text()',  xmlValue)
)

dummy2013$text = unlist(lapply(dummy2013$hrefs,function(x)
{
  url.story <- gsub('/entity','http://www.who.int',x)
  texts <- xpathSApply(htmlParse(url.story), 
                       '//*[@id="primary"]',xmlValue)
}))

dummy2013$link <- gsub('/entity','http://www.who.int',dummy2013$hrefs)

write.csv(dummy2013, "whoDON2013.csv")

然而,应用于姐妹 URL 时,事情就坏了.尝试

However, applied to sister URLs, things break. Trying

#=======2011========================================================================
url2011 = 'http://www.who.int/csr/don/archive/year/2011/en/index.html'
doc <- htmlParse(url2011)
dummy2011 <- data.frame(
  dates = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlGetAttr,'href'),
  title = xpathSApply(doc, '//*[@class="link_info"]/text()',  xmlValue)
)

例如,产生

## Error in data.frame(dates = xpathSApply(doc, "//*[@class=\"auto_archive\"]/li/a",  : 
  arguments imply differing number of rows: 59, 60

http://www 出现类似错误.who.int/csr/don/archive/year/2008/en/index.htmlhttp://www.who.int/csr/don/archive/year/2006/en/index.html.我不擅长 HTML 或 XML;任何想法表示赞赏.

Similar errors occur for http://www.who.int/csr/don/archive/year/2008/en/index.html and http://www.who.int/csr/don/archive/year/2006/en/index.html. I'm not handy with HTML or XML; any ideas appreciated.

推荐答案

您可以先选择标题,然后找到与之关联的 href

You can select the titles first then find the href associated with them

require(XML)
url2011 = 'http://www.who.int/csr/don/archive/year/2011/en/index.html'
doc <- htmlParse(url2011)
titleNodes <- getNodeSet(doc, '//*[@class="link_info"]')
hrefNodes <- sapply(titleNodes, getNodeSet, path = './preceding-sibling::a')

dummy2011 <- data.frame(
    dates = sapply(hrefNodes, xmlValue),
    hrefs = sapply(hrefNodes, xmlAttrs),
    title = sapply(titleNodes, xmlValue),
    stringsAsFactors = FALSE
)

更新:

删除可以使用的重复值

dummy2011 <- dummy2011[!duplicated(dummy2011$hrefs),]

这篇关于在 R 中抓取相关页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆