如果我知道里面的标签,我怎么才能使用Perl的HTML模块找到div的内容? [英] How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?

查看:80
本文介绍了如果我知道里面的标签,我怎么才能使用Perl的HTML模块找到div的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

自从我问了如何用正则表达式解析html,并且遭到了一些攻击(理所当然),我一直在研究 HTML :: TreeBuilder HTML :: Parser HTML :: TokeParser HTML :: Elements Perl模块。



我有这样的HTML:

 < div id =listSubtitlesFilm> 
< dt id =a1>
< a href =/ 45 / subtitles-67624.aspx>
.45(2006)
< / a>
< / dt>
< / div>

我想解析 /45/subtitles-67624.asp ,但更重要的是我想知道如何解析div的内容



我被给了这个上一个问题的例子:

  while(my $ anchor = $ parser-> get_tag('a')){
if(my $ href = $ anchor-> get_attr('href')){
#http://subscene.com/english/Sit-Down-Shut-Up-First-Season/subtitles -272112.aspx
push @dnldLinks,$ 1 if $ href =〜m!/ subtitle-(\d {2,8})\.aspx !;
}

这个工作非常完美,但是当我尝试编辑它时,在``div`上使用它不起作用。这是我试过的代码:



我试过使用这段代码:

 <$ c $ ($ my $ anchor = $ p-> get_tag(dt)){
if($ stuff = $ anchor-> get_attr('a1')){
print $ stuff \\\
;
}
}


解决方案

地址,您的具体问题,给定的HTML:

 < div id =listSubtitlesFilm> 
< dt id =a1>
< a href =/ 45 / subtitles-67624.aspx>
.45(2006)
< / a>
< / dt>
< / div>

我假设您对锚文本感兴趣,即。 (2006),在这种情况下,但只有当锚点出现在 div 时,id listSubtitlesFilm

 #!/ usr / bin / perl 

use strict;
使用警告;

使用HTML :: TokeParser :: Simple;

my $ parser = HTML :: TokeParser :: Simple-> new(handle => \ * DATA);

my @dnldLinks;

while(my $ div = $ parser-> get_tag('div')){
my $ id = $ div-> get_attr('id');
next除非定义($ id)和$ id eq'listSubtitlesFilm';

my $ anchor = $ parser-> get_tag('a');
my $ href = $ anchor-> get_attr('href');
next除非定义($ href)
和$ href =〜m!/ subtitles-(\d {2,8})\.aspx\z !;
push @dnldLinks,[$ parser-> get_trimmed_text('/ a'),$ 1];
}

使用Data :: Dumper;
打印Dumper \ @ dnldLinks;


__DATA__
< div id =listSubtitlesFilm>
< dt id =a1>
< a href =/ 45 / subtitles-67624.aspx>
.45(2006)
< / a>
< / dt>
< / div>

输出:

 
$ VAR1 = [
[
'.45(2006)',
'67624'
]
];


Ever since I asked how to parse html with regex and got bashed a bit (rightfully so), I've been studying HTML::TreeBuilder, HTML::Parser, HTML::TokeParser, and HTML::Elements Perl modules.

I have HTML like this:

<div id="listSubtitlesFilm">
  <dt id="a1">
    <a href="/45/subtitles-67624.aspx">
      .45 (2006)
    </a>
  </dt>
</div>

I want to parse out the /45/subtitles-67624.asp, but more importantly I want to know how to parse out the contents of the div.

I was given this example on a previous question:

while ( my $anchor = $parser->get_tag('a') ) {
    if ( my $href = $anchor->get_attr('href') ) {
 #http://subscene.com/english/Sit-Down-Shut-Up-First-Season/subtitles-272112.aspx
        push @dnldLinks, $1 if $href =~ m!/subtitle-(\d{2,8})\.aspx!;
    }

This worked perfectly for that, but when I tried to edit it a bit and use it on a ``div` it didn't work. Here is the code I tried:

I tried using this code:

while (my $anchor = $p->get_tag("dt")) {
  if($stuff = $anchor->get_attr('a1')) {
    print $stuff."\n";
  }
}

解决方案

To address, your specific question, given the HTML:

<div id="listSubtitlesFilm">
  <dt id="a1">
    <a href="/45/subtitles-67624.aspx">
      .45 (2006)
    </a>
  </dt>
</div>

I am assuming you are interested in the anchor text, i.e. ".45 (2006)", in this case, but only if the anchor occurs in a div with id listSubtitlesFilm.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);

my @dnldLinks;

while ( my $div = $parser->get_tag('div') ) {
    my $id = $div->get_attr('id');
    next unless defined($id) and $id eq 'listSubtitlesFilm';

    my $anchor = $parser->get_tag('a');
    my $href = $anchor->get_attr('href');
    next unless defined($href)
        and $href =~ m!/subtitles-(\d{2,8})\.aspx\z!;
    push @dnldLinks, [$parser->get_trimmed_text('/a'), $1];
}

use Data::Dumper;
print Dumper \@dnldLinks;


__DATA__
<div id="listSubtitlesFilm">
  <dt id="a1">
    <a href="/45/subtitles-67624.aspx">
      .45 (2006)
    </a>
  </dt>
</div>

Output:

$VAR1 = [
          [
            '.45 (2006)',
            '67624'
          ]
        ];

这篇关于如果我知道里面的标签,我怎么才能使用Perl的HTML模块找到div的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆