如何解析< div class =" foo">和< / div>很容易在Perl中 [英] How to parse between <div class ="foo"> and </div> easily in Perl

查看：266 发布时间：2018/6/20 15:23:00 html perl parsing

本文介绍了如何解析< div class =" foo">和< / div>很容易在Perl中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想将网站解析为Perl数据结构。
首先我用

 加载页面，使用LWP :: Simple; 
 my $ html = get（http：//f.oo）;

现在我知道两种处理方法。
首先是正则表达式和secound模块。

我从阅读关于 HTML :: Parser ，并找到了一些例子。
但是我不是很确定Perl的知识。

我的代码示例继续
my @links; my $ p = HTML :: Parser-> new（）; $ p->处理程序（start => \& start_handler，tagname，attr，self）; $ p-> parse（$ html）; foreach my $ link（@links）{ printLinktext：，$ link-> [1]，\tURL：，$ link-> [0 ]， \\\ ; } sub start_handler { return if（shift ne'a'）; my（$ class）= shift-> {href}; my $ self = shift; my $ text; $ self>处理程序（text => sub {$ text = shift;}，dtext）; $ self> handler（end => sub {push（@links，[$ class，$ text]）if（shift eq'a'）}，tagname）; }
我不明白为什么有两次换班。 Secondary应该是自我指针。但是第一个让我认为自引用已经是allshiftet了，用作Hash，href的值存储在 $ class 中。可能有人解释这一行（my（$ class）= shift-> {href}; ）？

除此之外，我不想分析所有的URL，我想把所有的代码放在< div class =foo> 和<$ c $之间c>< / div> 转换为一个字符串，其中包含许多代码，特别是其他< div>< / div> 标记。所以我或一个模块必须找到合适的结局。
之后，我计划再次扫描字符串，以找到特殊的类，如< h1>< h2>< p class =foo2>< / p> ; 等。

我希望这些信息可以帮助你给我一些有用的建议，请记住，首先我想一个容易理解的方式，这在第一级别上并不是很好的表现！ 使用 HTML :: TokeParser :: Simple 。

＃！/ usr / bin / env perl 严格使用;使用警告; 使用HTML :: TokeParser :: Simple; my $ p = HTML :: TokeParser :: Simple-> new（url =>'http://example.com/example.html'）; my $ level; while（my $ tag = $ p-> get_tag（'div'））{ my $ class = $ tag-> get_attr（'class'）; next除非定义（$ class）和$ class eq'foo'; $ level + = 1; while（my $ token = $ p-> get_token）{ $ level + = 1 if $ token-> is_start_tag（'div'）; $ level - = 1 if $ token-> is_end_tag（'div'）; print $ token-> as_is; 除非（$ level）{ last; } } }

I want to parse a Website into a Perl data structure. First I load the page with
use LWP::Simple; my $html = get("http://f.oo");
Now I know two ways to deal with it. First are the regular expressions and secound the modules.

I started with reading about HTML::Parser and found some examples. But I'm not that sure about by Perl knowledge.

My code example goes on
my @links; my $p = HTML::Parser->new(); $p->handler(start => \&start_handler,"tagname,attr,self"); $p->parse($html); foreach my $link(@links){ print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n"; } sub start_handler{ return if(shift ne 'a'); my ($class) = shift->{href}; my $self = shift; my $text; $self->handler(text => sub{$text = shift;},"dtext"); $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a')},"tagname"); }
I don't understand why there is two times a shift. The secound should be the self pointer. But the first makes me think that the self reference is allready shiftet, used as a Hash and the Value for href is stored in $class. Could someone Explain this line (my ($class) = shift->{href};)?

Beside this lack, I do not want to parse all the URLs, I want to put all the code between <div class ="foo"> and </div> into a string, where lots of code is between, specially other <div></div> tags. So I or a module has to find the right end. After that I planed to scan the string again, to find special classes, like <h1>,<h2>, <p class ="foo2"></p>, etc.

I hope this informations helps you to give me some usefull advices, and please have in mind that first of all I want an easy understanding way, which has not to be a great performance in the first level!
解决方案
Use HTML::TokeParser::Simple.

Untested code based on your description:
#!/usr/bin/env perl use strict; use warnings; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new(url => 'http://example.com/example.html'); my $level; while (my $tag = $p->get_tag('div')) { my $class = $tag->get_attr('class'); next unless defined($class) and $class eq 'foo'; $level += 1; while (my $token = $p->get_token) { $level += 1 if $token->is_start_tag('div'); $level -= 1 if $token->is_end_tag('div'); print $token->as_is; unless ($level) { last; } } }

这篇关于如何解析< div class =" foo">和< / div>很容易在Perl中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何解析< div class =" foo">和< / div>很容易在Perl中 [英] How to parse between <div class ="foo"> and </div> easily in Perl

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何解析&lt; div class =&quot; foo&quot;&gt;和&lt; / div&gt;很容易在Perl中 [英] How to parse between &lt;div class =&quot;foo&quot;&gt; and &lt;/div&gt; easily in Perl

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

如何解析< div class =" foo">和< / div>很容易在Perl中 [英] How to parse between <div class ="foo"> and </div> easily in Perl

登录关闭