使用Perl解析html [英] Parse html using Perl

查看:108
本文介绍了使用Perl解析html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下HTML -

 < div> 
< strong>日期:< / strong>
2011年7月19日
< / div>

我一直使用HTML :: TreeBuilder来解析出使用标签的html特定部分或类然而,前面提到的html让我很难试图提取日期。



例如我试过 - $ /

pre $ for($ tree-> ; look_down('_tag'=>'div'))
{
my $ date = $ _-> look_down('_tag'=>'strong') - > as_trimmed_text;

但这似乎与之前使用< strong>相冲突。
我期待解析出'2011年7月19日'。我已阅读了TreeBuilder上的文档,但无法找到这样做的方法。



如何使用TreeBuilder执行此操作?

解决方案

这个dump方法在寻找围绕HTML :: TreeBuilder对象的方法中是非常有用的。 这里的解决方案是获取您感兴趣的元素(在本例中为< div>)的父元素并遍历其内容列表。您感兴趣的文本将是纯文本节点,即列表中不包含HTML :: Element对象的元素。

 #!/ usr / bin / perl 

use strict;
使用警告;

使用HTML :: TreeBuilder;

my $ tree = HTML :: TreeBuilder-> new;

$ tree-> parse(< < div>
< strong>日期:< / strong>
2011年7月19日
< / div>
END_OF_HTML

my $ date;

($ tree-> look_down(_tag =>'div')){
for($ div-> content_list){
$ date = $ _除非ref;
}
}

打印$ date \\\
;


I have the following HTML-

<div>
   <strong>Date: </strong>
       19 July 2011
</div>

I have been using HTML::TreeBuilder to parse out particular parts of html that are using either tags or classes however the aforementioned html is giving me difficulty in trying to extract the date only.

For instance I tried-

for ( $tree->look_down( '_tag' => 'div'))
{ 
my $date  = $_->look_down( '_tag' => 'strong' )->as_trimmed_text;

But that seems to conflict with an earlier use of <strong>. I am looking to parse out just the '19 July 2011'. I have read the documentation on TreeBuilder but can not find a way of doing this.

How can I do this using TreeBuilder?

解决方案

The "dump" method is invaluable in finding your way around an HTML::TreeBuilder object.

The solution here is to get the parent element of the element you're interested in (which is, in this case, the <div>) and iterate across its content list. The text you're interested in will be plain text nodes, i.e. elements in the list that are not references to HTML::Element objects.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;

$tree->parse(<<END_OF_HTML);
<div>
   <strong>Date: </strong>
       19 July 2011
</div>
END_OF_HTML

my $date;

for my $div ($tree->look_down( _tag => 'div')) {
  for ($div->content_list) {
    $date = $_ unless ref;
  }
}

print "$date\n";

这篇关于使用Perl解析html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆