使用Perl解析html [英] Parse html using Perl
问题描述
我有以下HTML -
< div>
< strong>日期:< / strong>
2011年7月19日
< / div>
我一直使用HTML :: TreeBuilder来解析出使用标签的html特定部分或类然而,前面提到的html让我很难试图提取日期。
例如我试过 - $ /
pre $ for($ tree-> ; look_down('_tag'=>'div'))
{
my $ date = $ _-> look_down('_tag'=>'strong') - > as_trimmed_text;
但这似乎与之前使用< strong>相冲突。
我期待解析出'2011年7月19日'。我已阅读了TreeBuilder上的文档,但无法找到这样做的方法。
如何使用TreeBuilder执行此操作?
这个dump方法在寻找围绕HTML :: TreeBuilder对象的方法中是非常有用的。 这里的解决方案是获取您感兴趣的元素(在本例中为< div>)的父元素并遍历其内容列表。您感兴趣的文本将是纯文本节点,即列表中不包含HTML :: Element对象的元素。
#!/ usr / bin / perl
use strict;
使用警告;
使用HTML :: TreeBuilder;
my $ tree = HTML :: TreeBuilder-> new;
$ tree-> parse(< < div>
< strong>日期:< / strong>
2011年7月19日
< / div>
END_OF_HTML
my $ date;
($ tree-> look_down(_tag =>'div')){
for($ div-> content_list){
$ date = $ _除非ref;
}
}
打印$ date \\\
;
I have the following HTML-
<div>
<strong>Date: </strong>
19 July 2011
</div>
I have been using HTML::TreeBuilder to parse out particular parts of html that are using either tags or classes however the aforementioned html is giving me difficulty in trying to extract the date only.
For instance I tried-
for ( $tree->look_down( '_tag' => 'div'))
{
my $date = $_->look_down( '_tag' => 'strong' )->as_trimmed_text;
But that seems to conflict with an earlier use of <strong>.
I am looking to parse out just the '19 July 2011'. I have read the documentation on TreeBuilder but can not find a way of doing this.
How can I do this using TreeBuilder?
The "dump" method is invaluable in finding your way around an HTML::TreeBuilder object.
The solution here is to get the parent element of the element you're interested in (which is, in this case, the <div>) and iterate across its content list. The text you're interested in will be plain text nodes, i.e. elements in the list that are not references to HTML::Element objects.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
$tree->parse(<<END_OF_HTML);
<div>
<strong>Date: </strong>
19 July 2011
</div>
END_OF_HTML
my $date;
for my $div ($tree->look_down( _tag => 'div')) {
for ($div->content_list) {
$date = $_ unless ref;
}
}
print "$date\n";
这篇关于使用Perl解析html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!