rvest 表抓取,包括链接 [英] rvest table scraping including links

查看:30
本文介绍了rvest 表抓取,包括链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从维基百科中抓取一些表格数据.一些表格列包含指向我想保留的其他文章的链接.我试过 这种方法,它没有保留 URL.查看 html_table() 函数描述,我没有找到包含这些的任何选项.是否有其他软件包或方法可以做到这一点?

I would like to scrape some table data from Wikipedia. Some of the table columns include links to other articles I'd like to preserve. I've tried this approach, which didn't preserve the URLs. Looking at the html_table() function description, I didn't find any options of including those. Is there another package or way to do this?

library("rvest")

url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"

simp <- url %>%
        html() %>%
        html_nodes(xpath='//*[@id="mw-content-text"]/table[3]') %>%
        html_table()

simp <- simp[[1]]

推荐答案

试试这个

library(XML)
library(httr)
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
doc <- content(GET(url))
getHrefs <- function(node, encoding) {  
  x <- xmlChildren(node)$a 
  if (!is.null(x)) paste0("http://", parseURI(url)$server, xmlGetAttr(x, "href"), " | ", xmlValue(x) ) else xmlValue(xmlChildren(node)$text) 
}
tab <- readHTMLTable(doc, which = 3, elFun = getHrefs)
head(tab[, 1:4])
# No. in\nseries No. in\nseason                                                                                              Title                                                               Directed by
# 1              1              1 http://en.wikipedia.org/wiki/Simpsons_Roasting_on_an_Open_Fire | Simpsons Roasting on an Open Fire http://en.wikipedia.org/wiki/David_Silverman_(animator) | David Silverman
# 2              2              2                                     http://en.wikipedia.org/wiki/Bart_the_Genius | Bart the Genius                                                           David Silverman
# 3              3              3                    http://en.wikipedia.org/wiki/Homer%27s_Odyssey_(The_Simpsons) | Homer's Odyssey                      http://en.wikipedia.org/wiki/Wes_Archer | Wes Archer
# 4              4              4       http://en.wikipedia.org/wiki/There%27s_No_Disgrace_Like_Home | There's No Disgrace Like Home                    http://en.wikipedia.org/wiki/Gregg_Vanzo | Gregg Vanzo
# 5              5              5                                   http://en.wikipedia.org/wiki/Bart_the_General | Bart the General                                                           David Silverman
# 6              6              6                                           http://en.wikipedia.org/wiki/Moaning_Lisa | Moaning Lisa                                                                Wes Archer

URL 被保留并用竖线 (|) 与文本分开.因此,您可以将其拆分,例如使用 strsplit(as.character(tab[, 3]), split = " | ", fixed = TRUE).

The URLs are preserved and separated by a pipe (|) from the text. So you could split it up for example by using strsplit(as.character(tab[, 3]), split = " | ", fixed = TRUE).

这篇关于rvest 表抓取,包括链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆