Perl的WWW :: Mechanize如何扩展自己添加JavaScript的HTML页面? [英] How can Perl's WWW::Mechanize expand HTML pages that add to themselves with JavaScript?

查看:172
本文介绍了Perl的WWW :: Mechanize如何扩展自己添加JavaScript的HTML页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如前面提到的问题所述,我为QuakeLive网站编写了一个爬虫程序。

我一直在使用 WWW :: Mechanize 来获取网页内容,并且除了包含匹配的网页外,其他所有网页都可以正常使用。问题是我需要得到所有这些ID:

As mentioned in a previous question, I'm coding a crawler for the QuakeLive website.
I've been using WWW::Mechanize to get the web content and this worked fine for all the pages except the one with matches. The problem is that I need to get all these kind of IDs:

<div id="ffa_c14065c8-d433-11df-a920-001a6433f796_50498929" class="areaMapC">

这些用于构建特定的匹配网址,但我根本无法这样做。

These are used to build specific matches URLs, but I simply can't.

我只能通过FireBug查看这些ID,并且没有任何页面下载器,解析器,我尝试过的getter可以帮助您。我所能得到的只是一个更简单的页面版本,代码是您可以通过Firefox中的显示源代码看到的代码。

I managed to see those IDs only via FireBug and no page downloader, parser, getter I tried was able to help here. All I can get is a simpler version of the page which code is the one you can see by "showing source code" in Firefox.

由于FireBug显示我可以使用的ID安全地假设它们已经被加载,但是我不明白为什么没有其他东西能够得到它们。它可能与JavaScript有关。

Since FireBug shows the IDs I can safely assume they are already loaded, but then I can't understand why nothing else gets them. It might have something to do with JavaScript.

你可以找到一个页面示例这里

You can find a page example HERE

推荐答案

必须在该网站上执行JavaScript代码。我没有意识到任何允许你这样做的库,然后在perl内部反思生成的DOM,所以只需控制一个实际的浏览器,然后询问它的DOM,或者只是它的一部分,看起来像一个好东西这种方式。

To get at the DOM containing those IDs you'll probably have to execute the javascript code on that site. I'm not aware of any libraries that'd allow you to do that, and then introspect the resulting DOM within perl, so just controlling an actual browser and later asking it for the DOM, or only parts of it, seems like a good way to go about this.

各种浏览器提供了以编程方式进行控制的方法。使用基于Mozilla的浏览器(例如Firefox),可以轻松加载 mozrepl 到浏览器中,从perl空间打开一个套接字,发送几行javascript代码以实际加载该页面,然后再添加一些javascript代码,以便为您提供DOM的各个部分你有兴趣回来。然后你可以用CPAN上的许多JSON模块中的一个来分析结果。

Various browsers provide ways to be controlled programatically. With a Mozilla based browser, such as Firefox, this could be as easy as loading mozrepl into the browser, opening a socket from perl space, sending a few lines of javascript code over to actually load that page, and then some more javascript code to give you the parts of the DOM you're interested in back. The result of that you could then parse with one of the many JSON modules on CPAN.

或者,你可以通过在你的页面上执行的javascript代码来找出什么它实际上是,然后在你的爬虫中模仿。

Alternatively, you could work through the javascript code executed on your page and figure out what it actually does, to then mimic that in your crawler.

这篇关于Perl的WWW :: Mechanize如何扩展自己添加JavaScript的HTML页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆