如何解析< div class =" foo">和< / div>很容易在Perl中 [英] How to parse between <div class ="foo"> and </div> easily in Perl

查看:266
本文介绍了如何解析< div class =" foo">和< / div>很容易在Perl中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将网站解析为Perl数据结构。
首先我用

 加载页面,使用LWP :: Simple; 
my $ html = get(http://f.oo);

现在我知道两种处理方法。
首先是正则表达式和secound模块。



我从阅读关于 HTML :: Parser ,并找到了一些例子。
但是我不是很确定Perl的知识。



我的代码示例继续

  my @links; 

my $ p = HTML :: Parser-> new();
$ p->处理程序(start => \& start_handler,tagname,attr,self);
$ p-> parse($ html);

foreach my $ link(@links){
printLinktext:,$ link-> [1],\tURL:,$ link-> [0 ], \\\
;
}

sub start_handler {
return if(shift ne'a');
my($ class)= shift-> {href};
my $ self = shift;
my $ text;
$ self>处理程序(text => sub {$ text = shift;},dtext);
$ self> handler(end => sub {push(@links,[$ class,$ text])if(shift eq'a')},tagname);
}

我不明白为什么有两次换班。 Secondary应该是自我指针。但是第一个让我认为自引用已经是allshiftet了,用作Hash,href的值存储在 $ class 中。可能有人解释这一行(my($ class)= shift-> {href}; )?



除此之外,我不想分析所有的URL,我想把所有的代码放在< div class =foo> 和<$ c $之间c>< / div> 转换为一个字符串,其中包含许多代码,特别是其他< div>< / div> 标记。所以我或一个模块必须找到合适的结局。
之后,我计划再次扫描字符串,以找到特殊的类,如< h1>< h2>< p class =foo2>< / p> ; 等。



我希望这些信息可以帮助你给我一些有用的建议,请记住,首先我想一个容易理解的方式,这在第一级别上并不是很好的表现! 使用 HTML :: TokeParser :: Simple



 #!/ usr / bin / env perl 

严格使用;使用警告;

使用HTML :: TokeParser :: Simple;

my $ p = HTML :: TokeParser :: Simple-> new(url =>'http://example.com/example.html');

my $ level;

while(my $ tag = $ p-> get_tag('div')){
my $ class = $ tag-> get_attr('class');
next除非定义($ class)和$ class eq'foo';

$ level + = 1;

while(my $ token = $ p-> get_token){
$ level + = 1 if $ token-> is_start_tag('div');
$ level - = 1 if $ token-> is_end_tag('div');
print $ token-> as_is;
除非($ level){
last;
}
}
}


I want to parse a Website into a Perl data structure. First I load the page with

use LWP::Simple;
my $html = get("http://f.oo");

Now I know two ways to deal with it. First are the regular expressions and secound the modules.

I started with reading about HTML::Parser and found some examples. But I'm not that sure about by Perl knowledge.

My code example goes on

my @links;

my $p = HTML::Parser->new();
$p->handler(start => \&start_handler,"tagname,attr,self");
$p->parse($html);

foreach my $link(@links){
  print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n";
}

sub start_handler{
  return if(shift ne 'a');
  my ($class) = shift->{href};
  my $self = shift;
  my $text;
  $self->handler(text => sub{$text = shift;},"dtext");
  $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a')},"tagname");
}

I don't understand why there is two times a shift. The secound should be the self pointer. But the first makes me think that the self reference is allready shiftet, used as a Hash and the Value for href is stored in $class. Could someone Explain this line (my ($class) = shift->{href};)?

Beside this lack, I do not want to parse all the URLs, I want to put all the code between <div class ="foo"> and </div> into a string, where lots of code is between, specially other <div></div> tags. So I or a module has to find the right end. After that I planed to scan the string again, to find special classes, like <h1>,<h2>, <p class ="foo2"></p>, etc.

I hope this informations helps you to give me some usefull advices, and please have in mind that first of all I want an easy understanding way, which has not to be a great performance in the first level!

解决方案

Use HTML::TokeParser::Simple.

Untested code based on your description:

#!/usr/bin/env perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $p = HTML::TokeParser::Simple->new(url => 'http://example.com/example.html');

my $level;

while (my $tag = $p->get_tag('div')) {
    my $class = $tag->get_attr('class');
    next unless defined($class) and $class eq 'foo';

    $level += 1;

    while (my $token = $p->get_token) {
        $level += 1 if $token->is_start_tag('div');
        $level -= 1 if $token->is_end_tag('div');
        print $token->as_is;
        unless ($level) {
            last;
        }
    }
}

这篇关于如何解析&lt; div class =&quot; foo&quot;&gt;和&lt; / div&gt;很容易在Perl中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆