R 使用 rvest 和 V8 进行网页抓取 [英] R Web Scraping with rvest and V8

查看:56
本文介绍了R 使用 rvest 和 V8 进行网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 R 来抓取

I am trying to use R to scrape the various tables on https://www.rotowire.com/football/player.php?id=4307 however due to the fact they employ javascript I have hit a few snags. I have installed the rvest and V8 libraries and tried to find the proper nodes however I am pretty sure I am not properly specifying the proper table nodes. I checked with the website owners and they are ok with people scraping their data.

The V8 webpage includes a snippet of example code to scrape email addresses. I tried to modify that code to suit my purposes.

#Loading both the required libraries
library(rvest)
library(V8)

link <- 'https://www.rotowire.com/football/player.php?id=4307'
emailjs <- read_html(link) %>% html_nodes('div') %>% html_nodes('basicStats') %>% html_text()

ct <- v8()
read_html(ct$eval(gsub('document.write','',emailjs))) %>% 
  html_text()

With no success

I have also tried:

emailjs <- read_html(link) %>% html_nodes('div') %>% html_nodes('script') %>% html_text()
ct <- v8()
read_html(ct$eval(gsub('document.write','',emailjs))) %>% 
   html_text()

As well as:

emailjs <- read_html(link) %>% html_nodes('div') %>% html_nodes('basicStats') %>% html_text()

The first chunk of code fails because I am incorrectly specifying the node, or at least that is what I think is the reason.

The second set of code pulls back everything however it gives the below error:

Error in context_eval(join(src), private$context) : 
  ReferenceError: window is not defined

If you look at the source the HTML the table starts with:

>div id="basicStats" class="")

on line 289

The html:

            <div class="p-page__middle-box">

<div id="basicStats-header" class="p-page__section-head is-stats">NFL Stats</div>
<div id="basicStats">
    <div class="table-load"><div class="table-load__inner"><div class="loader"></div>Loading NFL Stats...</div></div>    </div>
    <script async>
document.addEventListener('rw:pp-data-available', function(e){
    var defaultData = { 'basic': { 'body': [], 'footer': [] }};
    var data = (e.detail) ? e.detail : defaultData;
    var tableID = "basicStats";
    var playerID = "4307";
    var primaryStatCat = "Pass";

    var stats = {
    'pass': [
        { id: 'passComp', startOfGroup: true, header: [{ text: 'Passing', colspan: 6, }, 'COMP'], },
        { id: 'passAtt', header: ['', 'ATT'], },
        { id: 'passPct', header: ['', 'PCT'], },
        { id: 'passYds', header: ['', 'YDS'], },
        { id: 'passTD', header: ['', 'TD'], },
        { id: 'passInt', header: ['', 'INT'], },
    ],

解决方案

It is available if you use the same endpoint the page does to update content.It returns json with all the info.

library(httr)
r <-GET("https://www.rotowire.com/football/ajax/player-page-data.php?id=4307&pos=QB&team=GB&opp=")
json <- content(r,as="parsed")

Do what you want with the json. Explore the json here or paste the URL in FireFox browser.


You can find that url in the network tab

这篇关于R 使用 rvest 和 V8 进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆