rvest 表抓取,包括链接 [英] rvest table scraping including links
问题描述
我想从维基百科中抓取一些表格数据.一些表格列包含指向我想保留的其他文章的链接.我试过 这种方法,它没有保留 URL.查看 html_table() 函数描述,我没有找到包含这些的任何选项.是否有其他软件包或方法可以做到这一点?
I would like to scrape some table data from Wikipedia. Some of the table columns include links to other articles I'd like to preserve. I've tried this approach, which didn't preserve the URLs. Looking at the html_table() function description, I didn't find any options of including those. Is there another package or way to do this?
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
simp <- url %>%
html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table[3]') %>%
html_table()
simp <- simp[[1]]
推荐答案
试试这个
library(XML)
library(httr)
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
doc <- content(GET(url))
getHrefs <- function(node, encoding) {
x <- xmlChildren(node)$a
if (!is.null(x)) paste0("http://", parseURI(url)$server, xmlGetAttr(x, "href"), " | ", xmlValue(x) ) else xmlValue(xmlChildren(node)$text)
}
tab <- readHTMLTable(doc, which = 3, elFun = getHrefs)
head(tab[, 1:4])
# No. in\nseries No. in\nseason Title Directed by
# 1 1 1 http://en.wikipedia.org/wiki/Simpsons_Roasting_on_an_Open_Fire | Simpsons Roasting on an Open Fire http://en.wikipedia.org/wiki/David_Silverman_(animator) | David Silverman
# 2 2 2 http://en.wikipedia.org/wiki/Bart_the_Genius | Bart the Genius David Silverman
# 3 3 3 http://en.wikipedia.org/wiki/Homer%27s_Odyssey_(The_Simpsons) | Homer's Odyssey http://en.wikipedia.org/wiki/Wes_Archer | Wes Archer
# 4 4 4 http://en.wikipedia.org/wiki/There%27s_No_Disgrace_Like_Home | There's No Disgrace Like Home http://en.wikipedia.org/wiki/Gregg_Vanzo | Gregg Vanzo
# 5 5 5 http://en.wikipedia.org/wiki/Bart_the_General | Bart the General David Silverman
# 6 6 6 http://en.wikipedia.org/wiki/Moaning_Lisa | Moaning Lisa Wes Archer
URL 被保留并用竖线 (|
) 与文本分开.因此,您可以将其拆分,例如使用 strsplit(as.character(tab[, 3]), split = " | ", fixed = TRUE)
.
The URLs are preserved and separated by a pipe (|
) from the text. So you could split it up for example by using strsplit(as.character(tab[, 3]), split = " | ", fixed = TRUE)
.
这篇关于rvest 表抓取,包括链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!