如何解析< div class =" foo">和< / div>很容易在Perl中 [英] How to parse between <div class ="foo"> and </div> easily in Perl
问题描述
我想将网站解析为Perl数据结构。
首先我用
加载页面,使用LWP :: Simple;
my $ html = get(http://f.oo);
现在我知道两种处理方法。
首先是正则表达式和secound模块。
我从阅读关于 HTML :: Parser ,并找到了一些例子。
但是我不是很确定Perl的知识。
我的代码示例继续
my @links;
my $ p = HTML :: Parser-> new();
$ p->处理程序(start => \& start_handler,tagname,attr,self);
$ p-> parse($ html);
foreach my $ link(@links){
printLinktext:,$ link-> [1],\tURL:,$ link-> [0 ], \\\
;
}
sub start_handler {
return if(shift ne'a');
my($ class)= shift-> {href};
my $ self = shift;
my $ text;
$ self>处理程序(text => sub {$ text = shift;},dtext);
$ self> handler(end => sub {push(@links,[$ class,$ text])if(shift eq'a')},tagname);
}
我不明白为什么有两次换班。 Secondary应该是自我指针。但是第一个让我认为自引用已经是allshiftet了,用作Hash,href的值存储在 $ class
中。可能有人解释这一行(my($ class)= shift-> {href};
)?
除此之外,我不想分析所有的URL,我想把所有的代码放在< div class =foo>
和<$ c $之间c>< / div> 转换为一个字符串,其中包含许多代码,特别是其他< div>< / div>
标记。所以我或一个模块必须找到合适的结局。
之后,我计划再次扫描字符串,以找到特殊的类,如< h1>< h2>< p class =foo2>< / p> ;
等。
我希望这些信息可以帮助你给我一些有用的建议,请记住,首先我想一个容易理解的方式,这在第一级别上并不是很好的表现! 使用 HTML :: TokeParser :: Simple 。
#!/ usr / bin / env perl
严格使用;使用警告;
使用HTML :: TokeParser :: Simple;
my $ p = HTML :: TokeParser :: Simple-> new(url =>'http://example.com/example.html');
my $ level;
while(my $ tag = $ p-> get_tag('div')){
my $ class = $ tag-> get_attr('class');
next除非定义($ class)和$ class eq'foo';
$ level + = 1;
while(my $ token = $ p-> get_token){
$ level + = 1 if $ token-> is_start_tag('div');
$ level - = 1 if $ token-> is_end_tag('div');
print $ token-> as_is;
除非($ level){
last;
}
}
}
I want to parse a Website into a Perl data structure. First I load the page with
use LWP::Simple;
my $html = get("http://f.oo");
Now I know two ways to deal with it. First are the regular expressions and secound the modules.
I started with reading about HTML::Parser and found some examples. But I'm not that sure about by Perl knowledge.
My code example goes on
my @links;
my $p = HTML::Parser->new();
$p->handler(start => \&start_handler,"tagname,attr,self");
$p->parse($html);
foreach my $link(@links){
print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n";
}
sub start_handler{
return if(shift ne 'a');
my ($class) = shift->{href};
my $self = shift;
my $text;
$self->handler(text => sub{$text = shift;},"dtext");
$self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a')},"tagname");
}
I don't understand why there is two times a shift. The secound should be the self pointer. But the first makes me think that the self reference is allready shiftet, used as a Hash and the Value for href is stored in $class
. Could someone Explain this line (my ($class) = shift->{href};
)?
Beside this lack, I do not want to parse all the URLs, I want to put all the code between <div class ="foo">
and </div>
into a string, where lots of code is between, specially other <div></div>
tags. So I or a module has to find the right end.
After that I planed to scan the string again, to find special classes, like <h1>,<h2>, <p class ="foo2"></p>
, etc.
I hope this informations helps you to give me some usefull advices, and please have in mind that first of all I want an easy understanding way, which has not to be a great performance in the first level!
Untested code based on your description:
#!/usr/bin/env perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new(url => 'http://example.com/example.html');
my $level;
while (my $tag = $p->get_tag('div')) {
my $class = $tag->get_attr('class');
next unless defined($class) and $class eq 'foo';
$level += 1;
while (my $token = $p->get_token) {
$level += 1 if $token->is_start_tag('div');
$level -= 1 if $token->is_end_tag('div');
print $token->as_is;
unless ($level) {
last;
}
}
}
这篇关于如何解析< div class =" foo">和< / div>很容易在Perl中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!