如何使用Google-apps脚本从延迟加载的网页(通过API)中抓取数据? [英] How to use Google-apps-script to scrape data from a web page that is lazy loaded (via an API)?

查看:66
本文介绍了如何使用Google-apps脚本从延迟加载的网页(通过API)中抓取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Google-apps-script创建一个自动化流程,以从此类页面中抓取价格数据:

这是我测试并工作的代码:

  function testFunction(){var url ='https://www.barchart.com/proxies/core-api/v1/historical/get?symbol=%24AVVN&fields=tradeTime.format(m%2Fd%2Fy)%2CopenPrice%2ChighPrice%2ClowPrice%2ClastPrice%2CpriceChange%2Cvolume%2CsymbolCode%2CsymbolType& startDate = 2019-04-16& endDate = 2019-07-16& type = eod& orderBy = tradeTime& orderDir = desc& limit = 2000& meta = field.shortName%2Cfield.类型%2Cfield.description& raw = 1';var map = {"x-xsrf-token":"XXXXX","Cookie":"XXXXX"}var options = {"method":"get","muteHttpExceptions":否,标题":地图};var response = UrlFetchApp.fetch(url,options);Logger.log(响应);var json = JSON.parse(response);Logger.log(json.data [0]);} 

[1] https://开发人员.google.com/apps-script/reference/url-fetch/url-fetch-app

[2] CSRF和X-CSRF-Token之间的区别

I'm trying to create an automated process using Google-apps-script for scraping price data from pages like this one:

https://www.barchart.com/stocks/quotes/$AVVN/price-history/historical

The challenging part is, that the data on the web page is 'lazy loaded', so the 'traditional' scaping methods, that I have used on other web pages, don't work here.

I have considered other ways of solving this problem - but:

  • Barchart does not provide data for e.g. $AVVN via http: //marketdata.websol.barchart.com/getHistory
  • I don't want to use the 'Download'-button - as this requires automated login.
  • ImportXML() does not work (it works for other tables on the web page, but not for the one I want).

I found a similar problem in the following post - that received a very detailed and informative reply from omegastripes: Open webpage, select all, copy into sheet

-but when I run my code:

function test(){
  var url = 'https://www.barchart.com/proxies/core-api/v1/historical/get?symbol=%24AVVN&fields=tradeTime.format(m%2Fd%2Fy)%2CopenPrice%2ChighPrice%2ClowPrice%2ClastPrice%2CpriceChange%2Cvolume%2CsymbolCode%2CsymbolType&startDate=2019-04-15&endDate=2019-07-15&type=eod&orderBy=tradeTime&orderDir=desc&limit=2000&meta=field.shortName%2Cfield.type%2Cfield.description&raw=1'; 
  var options = {
     "muteHttpExceptions": false
  };
  var response   = UrlFetchApp.fetch(url, options);   
  Logger.log(response);
}

-then I get the following error:

Request failed for https://www.barchart.com/proxies/core-api/v1/historical/get?symbol=%24AVVN&fields=tradeTime.format(m%2Fd%2Fy)%2CopenPrice%2ChighPrice%2ClowPrice%2ClastPrice%2CpriceChange%2Cvolume%2CsymbolCode%2CsymbolType&startDate=2019-04-15&endDate=2019-07-15&type=eod&orderBy=tradeTime&orderDir=desc&limit=2000&meta=field.shortName%2Cfield.type%2Cfield.description&raw=1 returned code 500. Truncated server response: <!doctype html> <html itemscope itemtype="http://schema.org/WebPage" lang="en"> <head> <meta charset="UTF-8" /> <meta name="viewport" content="wi... (use muteHttpExceptions option to examine full response) (line 57, file "DS#1")

Basically an "Oops, something's wrong. Our apologies ... there seems to be a problem with this page." ... if you paste the address into your browser.

So my question is: How can data be scraped from this page or has Barchart now succesfully blocked this scraping option?

解决方案

The only way i found to obtain the data was using your workaround, getting the request URL to fetch from the console, but additionally you have to add the "x-xsrf-token" and "cookie" headers to the options when using fetch() method [1].

You can get the "x-xsrf-token" and "cookie" request headers from the console as well. Only problem is that the cookies and xsrf-token are valid up to 2 hours, this is because they implemented cross site request forgery protection [2]:

Here is the code i tested and worked:

function testFunction() {
  var url = 'https://www.barchart.com/proxies/core-api/v1/historical/get?symbol=%24AVVN&fields=tradeTime.format(m%2Fd%2Fy)%2CopenPrice%2ChighPrice%2ClowPrice%2ClastPrice%2CpriceChange%2Cvolume%2CsymbolCode%2CsymbolType&startDate=2019-04-16&endDate=2019-07-16&type=eod&orderBy=tradeTime&orderDir=desc&limit=2000&meta=field.shortName%2Cfield.type%2Cfield.description&raw=1';

  var map = {
    "x-xsrf-token": "XXXXX",
    "cookie": "XXXXX"
  }

  var options = {
     "method": "get", 
     "muteHttpExceptions": false,
     "headers": map
  };
  var response = UrlFetchApp.fetch(url, options);   
  Logger.log(response);

  var json = JSON.parse(response);
  Logger.log(json.data[0]);
}

[1] https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app

[2] Difference between CSRF and X-CSRF-Token

这篇关于如何使用Google-apps脚本从延迟加载的网页(通过API)中抓取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆