抓取网页,页面上的链接,并用R形成表格 [英] Scraping a web page, links on a page, and forming a table with R

查看:48
本文介绍了抓取网页,页面上的链接,并用R形成表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我是使用 R 从 Internet 抓取数据的新手,遗憾的是,我对 HTML 和 XML 知之甚少.我试图在以下父页面上抓取每个故事链接:http://www.who.int/csr/don/archive/year/2013/en/index.html.我不关心父页面上的任何其他链接,但需要创建一个表格,其中每个故事 URL 一行和相应 URL、故事标题、日期的列(它总是在开头故事标题后面的第一句话),然后是页面的其余文本(可以是几段文本).

Hello I'm new to using R to scrape data from the Internet and, sadly, know little about HTML and XML. Am trying to scrape each story link at the following parent page: http://www.who.int/csr/don/archive/year/2013/en/index.html. I don't care about any of the other links on the parent page, but need to create a table with a row for each story URL and columns for the corresponding URL, title of the story, date (it's always at the beginning of the first sentence following the story title), and then the rest of the text of the page (which can be several paragraphs of text).

我尝试在 为周期表"抓取维基页面;和所有链接(以及几个相关的线程),但遇到了困难.任何建议或指示将不胜感激.这是我到目前为止尝试过的(使用?????"我遇到了麻烦):

I've tried to adapt the code at Scraping a wiki page for the "Periodic table" and all the links (and several related threads) but run into difficulties. Any advice or pointers would be gratefully appreciated. Here's what I've tried so far (with "?????" where I run into trouble):

rm(list=ls())
library(XML)
library(plyr) 

url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)

links = getNodeSet(doc, ?????)

df = ldply(doc, function(x) {
  text = xmlValue(x)
  if (text=='') text=NULL

  symbol = xmlGetAttr(x, '?????')
  link = xmlGetAttr(x, 'href')
  if (!is.null(text) & !is.null(symbol) & !is.null(link))
    data.frame(symbol, text, link)
} )

df = head(df, ?????)

推荐答案

您可以 xpathSApply,(lapply 等效),在给定 Xpath 的文档中进行搜索.

You can xpathSApply, (lapply equivalent), that search in your document given an Xpath.

library(XML)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
data.frame(
  dates =  xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlGetAttr,'href'),
  story = xpathSApply(doc, '//*[@class="link_info"]/text()',xmlValue))

 ##               dates                                                hrefs
## 1      26 June 2013             /entity/csr/don/2013_06_26/en/index.html
## 2      23 June 2013             /entity/csr/don/2013_06_23/en/index.html
## 3      22 June 2013             /entity/csr/don/2013_06_22/en/index.html
## 4      17 June 2013             /entity/csr/don/2013_06_17/en/index.html

##                                                                                    story
## 1                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 2                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 3                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 4                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update

添加每个故事的文本

dat$text = unlist(lapply(dat$hrefs,function(x)
  {
    url.story <- gsub('/entity','http://www.who.int',x)
    texts <- xpathSApply(htmlParse(url.story), 
                         '//*[@id="primary"]',xmlValue)
    }))

这篇关于抓取网页,页面上的链接,并用R形成表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆