使用 Google Apps 脚本抓取网页 [英] Web scraping with Google Apps Script

查看:26
本文介绍了使用 Google Apps 脚本抓取网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Google Apps 脚本从以下示例网页中提取数据:

I'm trying to pull data from the following sample web page using Google Apps Script:

url = http://www.premierleague.com/球员/2064/韦恩-鲁尼/stats?se=54

使用,UrlFetchApp.Fetch(url)

using, UrlFetchApp.Fetch(url)

问题是当我使用 UrlFetchApp.Fetch(url) 来做到这一点时,我没有得到由 url 中的 'se' 参数定义的页面信息.相反,我获得了以下 URL 的信息,因为它看起来像se=54"页面是异步加载的:http://www.premierleague.com/players/2064/Wayne-Rooney/stats

The problem is when I use UrlFetchApp.Fetch(url) to do that, I don't get the page information defined by the 'se' parameter in the url. Instead, I get the information on the following URL because it looks like the 'se=54' page is asynchronously loaded: http://www.premierleague.com/players/2064/Wayne-Rooney/stats

有没有办法以其他方式传递参数se"?我正在查看该函数,它允许指定选项",正如所提到的那样,但有关该主题的文档非常有限.

Is there any way to pass the parameter 'se' some other way? I was looking at the function and it allows the specification of 'options', as they are referred to, but the documentation on the topic is very limited.

任何帮助将不胜感激.非常感谢

Any help would be most appreciated. Many thanks

汤米

推荐答案

在浏览器中转到该网站并打开开发人员工具(F12 或 ctr-shift-i).单击网络选项卡并使用 F5 重新加载页面.将出现请求列表.在列表的底部,您应该看到为获取信息而发出的异步请求.这些请求从footballapi.pulselive.com 获取json 格式的数据.你可以在应用程序脚本中做同样的事情.但是您必须发送正确的来源"标题行,否则您的请求将被拒绝.这是一个例子.

Go to that website in your browser and open the developer tools (F12 or ctr-shift-i). Click on the network tab and reload the page with F5. A list of requests will appear. At the bottom of the list you should see the asynchronous requests made to fetch the information. Those requests get the data in json form from footballapi.pulselive.com. You can do the same thing in apps script. But you have to send a correct "origin" header line or your request gets rejected. Here is an example.

function fetchData() {
  var url = "http://footballapi.pulselive.com/football/stats/player/2064?comps=1";
  var options = {
    "headers": {
      "Origin": "http://www.premierleague.com"
    }
  }
  var json = JSON.parse(UrlFetchApp.fetch(url, options).getContentText()); 
  for(var i = 0; i < json.stats.length; i++) {
    if(json.stats[i].name === "goals") Logger.log(json.stats[i]);
  }
}

这篇关于使用 Google Apps 脚本抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆