WWW :: Mechanize提取帮助-PERL [英] WWW::Mechanize Extraction Help - PERL

查看:94
本文介绍了WWW :: Mechanize提取帮助-PERL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试自动提取网站上找到的笔录.由于该站点在描述列表中格式化了采访,因此可以在dl标签之间找到整个成绩单.我下面的脚本允许我搜索站点并以纯文本格式提取文本,但实际上我正在寻找包含dl标签之间的所有内容,这意味着dd,dt等.这将使我们能够开发我们自己的CSS进行采访.

I'm try to automate the extraction of a transcript found on a website. The entire transcript is found between dl tags since the site formatted the interview in a description list. The script I have below allows me to search the site and extract the text in a plain-text format, but I'm actually looking for it to include everything between the dl tags, meaning dd's, dt's, etc. This will allow us to develop our own CSS for the interview.

该页面需要注意的一点是,在面试过程中的不同时间点插入了break语句.我们发现一些使用配对从网页中提取信息的工具发现这是一个问题,因为它只能获取信息,直到break语句为止.如果您将我指向不同的方向,请记住一些注意事项.这就是我到目前为止所拥有的.

Something to note about the page is that there are break statements inserted at various points during the interview. Some tools we've found that extract information from webpages using pairings have found this to be a problem since it only grabs the information up until the break statement. Just something to keep in mind if you point me in a different direction. Here's what I have so far.

#!/usr/bin/perl -w

use strict;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;

my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/madeleine-k-albright");

# find all <dl> tags
my @list = $mech->find('dl');

foreach ( @list ) {
print $_->as_text();
}

如果有一个基本上可以打印我所拥有内容的工具,仅这次将其打印为HTML,请告诉我!

If there is a tool that essentially prints what I have, only this time as HTML, please let me know of it!

推荐答案

您的代码很好,只需将as_text()方法更改为as_HTML(),它将显示包含HTML标记的内容.

Your code is fine, just change the as_text() method to as_HTML() and it will show the content with HTML tags included.

这篇关于WWW :: Mechanize提取帮助-PERL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆