从体育页面抓取表格 - AdBlock 干扰 [英] Scraping Table From Sports Page - AdBlock Interfering
问题描述
我正在尝试从 http://www.sports-reference.com/cbb/schools/duke/2010.html.
使用 htmltab 或 XML 我已经能够使用整数引用(即第一个表为 1,第二个表为 2 等)或 XPath 抓取表 1 到表 3.但是,我无法使用相同的方法抓取表 4、5 或 6.
Using htmltab, or XML I have been able to scrape tables 1 through 3 using the the interger reference (ie 1 for first table, 2 for second etc) or the XPath. I can't scrape tables 4, 5, or 6 using the same methods, though.
library(htmltab)
url <- "http://www.sports-reference.com/cbb/schools/duke/2010.html"
duketable1 <- htmltab(doc = url, which = 1) #Using number
duketable1 <- htmltab(doc = url, which = "//*[@id='all_roster']") #Using XPath
无法使用相同的框架抓取表 6(或 4 和 5).
Cannot scrape table 6 (or 4 and 5) using the same framework.
duketable6 <- htmltab(doc = url, which = 6)
duketable6 <- htmltab(doc = url, which = "//*[@id='all_advanced']")
与 XML 相同(仅读取前三个表)
Same with XML (only reads first three tables)
library(XML)
url <- "http://www.sports-reference.com/cbb/schools/duke/2010.html"
tables <- readHTMLTable(url)
names(tables)
我最好的猜测是 <div class="adblock">
正在影响某些事情,但我不知道如何解决它.提前感谢您提供任何提示.
My best guess is <div class="adblock">
is affecting something but I have no idea how to get around it. Thanks in advance for any tips.
推荐答案
如果你查看源代码(在 chrome 中,即 查看来源:http://www.sports-reference.com/cbb/schools/duke/2010.html)你看到后面的表是通过 <!--
If you look at the source-code (in chrome i.e. view-source:http://www.sports-reference.com/cbb/schools/duke/2010.html)
You see that the latter tables are commented via <!--
只需替换此评论即可阅读.使用 rvest
Just replace this comment and you can read them. Using rvest
require(httr)
require(rvest)
doc <- GET("http://www.sports-reference.com/cbb/schools/duke/2010.html")
content(doc, "text") %>%
gsub(pattern = "<!--\n", "", ., fixed = TRUE) %>%
read_html %>%
html_nodes(".table_outer_container table") %>%
html_table
附加 %>% str(max.level = 1)
导致
List of 6
$ :'data.frame': 13 obs. of 5 variables:
$ :'data.frame': 4 obs. of 25 variables:
$ :'data.frame': 13 obs. of 23 variables:
$ :'data.frame': 13 obs. of 25 variables:
$ :'data.frame': 13 obs. of 23 variables:
$ :'data.frame': 13 obs. of 27 variables:
这篇关于从体育页面抓取表格 - AdBlock 干扰的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!