如何使用由 Javascript 函数生成的 Ruby 抓取数据? [英] How to scrape data using Ruby which is generated by a Javascript function?

查看:34
本文介绍了如何使用由 Javascript 函数生成的 Ruby 抓取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从 这个页面.表格的内容似乎是由 JavaScript 函数生成的.

I am trying to scrape the data URL link from the latest date, which is the first row of the table, from this page. It seems like the content of the table is generated by a JavaScript function.

我尝试使用 Nokogiri 来获取它,但 Nokogiri 无法抓取 JavaScript.然后,我尝试仅使用 Nokogiri 获取脚本部分:

I tried using Nokogiri to get it but Nokogiri can not scrape JavaScript. Then, I tried to get the script part only using Nokogiri using:

url = "http://www.sgx.com/wps/portal/sgxweb/home/marketinfo/historical_data/derivatives/daily_data"
doc = Nokogiri::HTML(open(url))
js = doc.css("script").text
puts js

在输出中,我找到了我想要的带有类名 sgxTableGrid 的表.但是,问题是 JavaScript 函数中没有关于数据 URL 链接的线索,并且一切都是动态生成的.

In the output I found the table that I wanted with class name sgxTableGrid. But, the problem is there is no clue about the data URL link here in the JavaScript function and everything is being generated dynamically.

有人知道解决这个问题的更好方法吗?

Does someone know a better way of approaching this problem?

推荐答案

查看该页面的 HTML,该表是由作为 JavaScript 请求结果接收到的 JSON 生成的.

Looking through the HTML for that page, the table is generated by JSON received as the result of a JavaScript request.

您可以通过向后追溯页面的源代码来弄清楚发生了什么.如果您想在其 JavaScript 之外检索 JSON,您将需要以下一些内容,但是仍然需要做一些工作来实际使用它:

You can figure out what's going on by tracing backwards through the source code of the page. Here's some of what you'll need if you want to retrieve the JSON outside of their JavaScript, however there'll still be work needed to actually do something with it:

  1. 从这段代码开始:

  1. Starting with this code:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://www.sgx.com/wps/portal/sgxweb/home/marketinfo/historical_data/derivatives/daily_data'))
scripts = doc.css('script').map(&:text)

puts scripts.select{ |s| s['sgxTableGrid'] }

在编辑器中查看文本输出.搜索 sgxTableGrid.您会看到如下一行:

Look at the text output in an editor. Search for sgxTableGrid. You'll see a line like:

var tableHeader =  "<table width='100%' class='sgxTableGrid'>"

再往下看一点,你会看到:

Look down a little farther and you'll see:

var totalRows = data.items.length - 1;

data 来自被调用函数的参数,这就是我们开始的地方.

data comes from the parameter to the function being called, so that's where we start.

获取包含函数名称的唯一部分 loadGridns_ 并搜索它.每次找到它,查找参数data,然后查看data 在哪里定义.如果它被传递到该方法中,那么搜索以查看调用它的内容.重复这个过程,直到你发现变量没有被传递到函数中,然后你就会知道你在创建它的方法中.

Get a unique part of the containing function's name loadGridns_ and search for it. Each time you find it, look for the parameter data, then look to see where data is defined. If it's passed into that method, then search to see what calls it. Repeat that process until you find that the variable isn't passed into the function, and at that point you'll know you're at the method that creates it.

我发现自己在一个以 loadGridDatans 开头的函数中,它是执行 xhrPost 调用以检索 URL 的块的一部分.该 URL 是您要查找的目标,因此获取包含函数的名称,然后循环调用传入 URL 的调用,就像您在上述步骤中所做的那样.

I found myself in a function that starts with loadGridDatans, where it's part of a block that does a xhrPost call to retrieve a URL. That URL is the target you're after, so grab the name of the containing function, and loop through the calls where the URL is passed in, like you did in the above step.

该搜索最终出现在如下所示的行上:

That search ended up on a line that looks like:

var url = viewByDailyns_7_2AA4H0C090FIE0I1OH2JFH20K1_...

  • 此时您可以开始重建您需要的 URL.打开一个 JavaScript 调试器,比如 Firebug,并在该行上放置一个断点.重新加载页面,JavaScript 应该会在该行停止执行.单步执行,或设置断点,并观察 url 变量的创建过程,直到它处于最终形式.那时,您可以在 OpenURI 中使用一些东西,应该检索您想要的 JSON.

  • At that point you can start reconstructing the URL you need. Open a JavaScript debugger, like Firebug, and put a break point on that line. Reload the page and JavaScript should stop executing at that line. Single-step, or set breakpoints, and watch the url variable be created until it's in its final form. At that point you have something you can use in OpenURI, which should retrieve the JSON you want.

    注意,它们的函数名可能是动态生成的;我没有检查,所以尝试使用函数的全名可能会失败.

    Notice, their function names might be generated dynamically; I didn't check to see, so trying to use the full name of the function might fail.

    他们也可能会序列化日期时间戳或使用序列化的会话密钥以使函数名称唯一/更不透明,这样做的原因有很多.

    They might also be serializing the datetime stamp or using a session-key that's serialized to make the function names unique/more opaque, doing it for a number of reasons.

    尽管拆开这些东西很痛苦,但这也是了解动态页面如何工作的一个很好的教训.

    Even though it's a pain to take this stuff apart, it's also a good lesson in how dynamic pages work.

    这篇关于如何使用由 Javascript 函数生成的 Ruby 抓取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆