Perl 的 WWW::Mechanize 如何扩展使用 JavaScript 添加到自身的 HTML 页面? [英] How can Perl's WWW::Mechanize expand HTML pages that add to themselves with JavaScript?

查看:20
本文介绍了Perl 的 WWW::Mechanize 如何扩展使用 JavaScript 添加到自身的 HTML 页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如上一个问题所述,我正在为 QuakeLive 网站编写爬虫程序.
我一直在使用 WWW::Mechanize 来获取网络内容,这对于除了匹配的页面之外的所有页面.问题是我需要获得所有这些类型的 ID:

这些用于构建特定的匹配 URL,但我根本不能.

我只能通过 FireBug 看到这些 ID,而我尝试过的页面下载器、解析器、getter 都无法提供帮助.我所能得到的只是一个更简单的页面版本,它的代码是您在 Firefox 中通过显示源代码"可以看到的代码.

由于 FireBug 显示了 ID,我可以放心地假设它们已经加载,但是我不明白为什么没有其他方法可以获取它们.它可能与 JavaScript 有关.

您可以在此处

找到页面示例

解决方案

要获取包含这些 ID 的 DOM,您可能必须在该站点上执行 javascript 代码.我不知道有什么库可以让你这样做,然后在 perl 中内省生成的 DOM,所以只需控制一个实际的浏览器,然后再向它请求 DOM,或者只请求它的一部分,这似乎是一个不错的选择方法来解决这个问题.

各种浏览器提供了以编程方式控制的方法.使用基于 Mozilla 的浏览器,例如 Firefox,这就像加载 mozrepl 进入浏览器,从 perl 空间打开一个套接字,发送几行 javascript 代码以实际加载该页面,然后再发送一些 javascript 代码来为您提供您感兴趣的 DOM 部分.然后,您可以使用 CPAN 上的众多 JSON 模块之一来解析结果.

或者,您可以通过在您的页面上执行的 javascript 代码并弄清楚它实际做了什么,然后在您的爬虫中模仿它.

As mentioned in a previous question, I'm coding a crawler for the QuakeLive website.
I've been using WWW::Mechanize to get the web content and this worked fine for all the pages except the one with matches. The problem is that I need to get all these kind of IDs:

<div id="ffa_c14065c8-d433-11df-a920-001a6433f796_50498929" class="areaMapC">

These are used to build specific matches URLs, but I simply can't.

I managed to see those IDs only via FireBug and no page downloader, parser, getter I tried was able to help here. All I can get is a simpler version of the page which code is the one you can see by "showing source code" in Firefox.

Since FireBug shows the IDs I can safely assume they are already loaded, but then I can't understand why nothing else gets them. It might have something to do with JavaScript.

You can find a page example HERE

解决方案

To get at the DOM containing those IDs you'll probably have to execute the javascript code on that site. I'm not aware of any libraries that'd allow you to do that, and then introspect the resulting DOM within perl, so just controlling an actual browser and later asking it for the DOM, or only parts of it, seems like a good way to go about this.

Various browsers provide ways to be controlled programatically. With a Mozilla based browser, such as Firefox, this could be as easy as loading mozrepl into the browser, opening a socket from perl space, sending a few lines of javascript code over to actually load that page, and then some more javascript code to give you the parts of the DOM you're interested in back. The result of that you could then parse with one of the many JSON modules on CPAN.

Alternatively, you could work through the javascript code executed on your page and figure out what it actually does, to then mimic that in your crawler.

这篇关于Perl 的 WWW::Mechanize 如何扩展使用 JavaScript 添加到自身的 HTML 页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆