如何使用 Perl 从使用 JavaScript 动态生成的网页中抓取文本? [英] How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

查看:34
本文介绍了如何使用 Perl 从使用 JavaScript 动态生成的网页中抓取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个网站我正试图从 Perl 中提取信息,但是我需要的页面部分是使用 javascript 生成的,因此您在源代码中看到的所有内容是:

<div id="results"></div>

我需要以某种方式提取该 div 的内容,并使用 Perl/代理/其他工具将其保存到一个文件中.例如我想保存的信息是

document.getElementById('results').innerHTML;

我不确定这是否可行,或者是否有人有任何想法或方法可以做到这一点.我正在为其他页面使用 lynx 源转储,但由于我无法直接从屏幕上抓取此页面,因此我来到这里询问它!

如果有人感兴趣,页面是http://downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file&regs=NABU 而我想要获得的信息是关于 ConsumerOPR 的那一行

解决方案

您需要对 Javascript 正在执行的操作进行逆向工程.它是否会触发 AJAX 请求来填充

?如果是这样,使用 Firebug 嗅探请求应该很容易,然后使用 LWP::UserAgentWWW::Mechanize 获取信息.

如果 Javascript 只是做纯 DOM 操作,那么这意味着数据必须存在于页面中的其他地方或 Javascript 已经存在.所以找出它的来源并抓住它.

最后,如果这些选项都不够用,您可能需要使用真正的浏览器来完成.有几个选项可用于自动化浏览器行为,例如 WWW::Mechanize::FirefoxWin32::IE::Mechanize.

There is a website I am trying to pull information from in Perl, however the section of the page I need is being generated using javascript so all you see in the source is:

<div id="results"></div>

I need to somehow pull out the contents of that div and save it to a file using Perl/proxies/whatever. e.g. the information I want to save would be

document.getElementById('results').innerHTML;

I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it!

If anyone is interested, the page is http://downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file&regs=NABU and the info I am trying to get is the row about the ConsumerOPR

解决方案

You'll need to reverse-engineer what the Javascript is doing. Does it fire off an AJAX request to populate the <div>? If so, it should be pretty easy to sniff the request using Firebug and then duplicate it with LWP::UserAgent or WWW::Mechanize to get the information.

If the Javascript is just doing pure DOM manipulation, then that means the data must exist somewhere else in the page or the Javascript already. So figure out where it's coming from and grab it.

Finally, if none of those options are adequate, you may need to just use a real browser to do it. There are a few options for automating browser behavior, like WWW::Mechanize::Firefox or Win32::IE::Mechanize.

这篇关于如何使用 Perl 从使用 JavaScript 动态生成的网页中抓取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆