从体育页面抓取表格 - AdBlock 干扰 [英] Scraping Table From Sports Page - AdBlock Interfering

查看:22
本文介绍了从体育页面抓取表格 - AdBlock 干扰的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 http://www.sports-reference.com/cbb/schools/duke/2010.html.

使用 htmltab 或 XML 我已经能够使用整数引用(即第一个表为 1,第二个表为 2 等)或 XPath 抓取表 1 到表 3.但是,我无法使用相同的方法抓取表 4、5 或 6.

Using htmltab, or XML I have been able to scrape tables 1 through 3 using the the interger reference (ie 1 for first table, 2 for second etc) or the XPath. I can't scrape tables 4, 5, or 6 using the same methods, though.

library(htmltab)
url <- "http://www.sports-reference.com/cbb/schools/duke/2010.html"
duketable1 <- htmltab(doc = url, which = 1) #Using number
duketable1 <- htmltab(doc = url, which = "//*[@id='all_roster']") #Using XPath

无法使用相同的框架抓取表 6(或 4 和 5).

Cannot scrape table 6 (or 4 and 5) using the same framework.

duketable6 <- htmltab(doc = url, which = 6)
duketable6 <- htmltab(doc = url, which = "//*[@id='all_advanced']")

与 XML 相同(仅读取前三个表)

Same with XML (only reads first three tables)

library(XML)
url <- "http://www.sports-reference.com/cbb/schools/duke/2010.html"
tables <- readHTMLTable(url)
names(tables)

我最好的猜测是 <div class="adblock"> 正在影响某些事情,但我不知道如何解决它.提前感谢您提供任何提示.

My best guess is <div class="adblock"> is affecting something but I have no idea how to get around it. Thanks in advance for any tips.

推荐答案

如果你查看源代码(在 chrome 中,即 查看来源:http://www.sports-reference.com/cbb/schools/duke/2010.html)你看到后面的表是通过 <!--

If you look at the source-code (in chrome i.e. view-source:http://www.sports-reference.com/cbb/schools/duke/2010.html) You see that the latter tables are commented via <!--

只需替换此评论即可阅读.使用 rvest

Just replace this comment and you can read them. Using rvest

require(httr)
require(rvest)

doc <- GET("http://www.sports-reference.com/cbb/schools/duke/2010.html")
content(doc, "text") %>% 
  gsub(pattern = "<!--\n", "", ., fixed = TRUE) %>% 
  read_html %>% 
  html_nodes(".table_outer_container table") %>% 
  html_table

附加 %>% str(max.level = 1) 导致

List of 6
 $ :'data.frame':   13 obs. of  5 variables:
 $ :'data.frame':   4 obs. of  25 variables:
 $ :'data.frame':   13 obs. of  23 variables:
 $ :'data.frame':   13 obs. of  25 variables:
 $ :'data.frame':   13 obs. of  23 variables:
 $ :'data.frame':   13 obs. of  27 variables:

这篇关于从体育页面抓取表格 - AdBlock 干扰的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆