如何使用Perl的LWP提取网站的XML并保存在文件中? [英] How can I extract XML of a website and save in a file using Perl's LWP?

查看:116
本文介绍了如何使用Perl的LWP提取网站的XML并保存在文件中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从网站提取信息( http://tv.yahoo.com/listings)然后从中创建一个XML文件?我想保存它以便稍后解析并使用JavaScript显示信息?

How can I extract information from a website (http://tv.yahoo.com/listings) and then create an XML file out of it? I want to save it so to parse later and display information using JavaScript?

我是Perl的新手,我不知道该怎么做。

I am quite new to Perl and I have no idea about how to do it.

推荐答案

当然。最简单的方法是 Web :: Scraper 模块。它的作用是让你定义刮刀对象,包括

Of course. The easiest way would be the Web::Scraper module. What it does is it lets you define scraper objects that consist of


  1. 哈希键名,

  2. 用于定位感兴趣元素的XPath表达式,

  3. 以及用于从中提取数据的代码。

Scraper对象获取一个URL并返回提取数据的哈希值。如果需要,每个键的提取器代码本身可以是另一个scraper对象,这样您就可以定义如何刮擦重复的复合页面元素:提供XPath以在外部刮刀中找到复合元素,然后提供更多的XPath来拉动在内部刮刀中分出它的各个位。然后结果自动成为嵌套数据结构。

Scraper objects take a URL and return a hash of the extracted data. The extractor code for each key can itself be another scraper object, if necessary, so that you can define how to scrape repeated compound page elements: provide the XPath to find the compound element in an outer scraper, then provide a bunch more XPaths to pull out its individual bits in an inner scraper. The result is then automatically a nested data structure.

简而言之,您可以非常优雅地将来自整个页面的数据吸收到Perl数据结构中。这样,XPath + Perl的全部功能可用于任何页面。由于页面是使用HTML :: TreeBuilder进行解析的,因此它的标签是多么令人讨厌。由此产生的刮刀脚本比基于正则表达式的刮刀更容易维护,并且更容忍轻微的标记变化。

In short, you can very elegantly suck data from all over a page into a Perl data structure. In doing so, the full power of XPath + Perl is available for use against any page. Since the page is parsed with HTML::TreeBuilder, it does not matter how nasty a tagsoup it is. The resulting scraper scripts are much easier to maintain and far more tolerant of minor markup variations than regex-based scrapers.

坏消息:到目前为止,它的文档几乎不是存在,所以你必须通过谷歌搜索来搜索类似[ miyagawa web :: scraper ]查找模块作者发布的示例脚本。

Bad news: as yet, its documentation is almost non-existent, so you have to get by with googling for something like [miyagawa web::scraper] to find example scripts posted by the module’s author.

这篇关于如何使用Perl的LWP提取网站的XML并保存在文件中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆