使用 Mojolicious 用户代理解析 HTML [英] Parsing HTML with Mojolicious User Agent

查看:43
本文介绍了使用 Mojolicious 用户代理解析 HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的 html

 

我的标题

<p class="class1"><strong>东西</strong>有趣(也许不是).</p><div class="mydiv"><p class="class2"><a href="http://www.link.com">有趣的链接</a></p><h2>其他一些标题</h2>

h1 和 h2 之间的内容各不相同 - 我知道我可以在 Mojo::Dom 中使用 css 选择器来选择 h1 或 h2 或 p 标签的内容 - 但是如何选择 h1 和 h2 之间的所有内容?或者更一般地说,任何两个给定标签集之间的所有内容?

解决方案

非常简单.您可以在 Mojo::Collection 对象中选择所有有趣的元素(这就是 Mojo::Collection="http://mojolicio.us/perldoc/Mojo/DOM" rel="noreferrer">Mojo::DOM 的 children 方法例如)并在迭代该集合时执行某种类似匹配的状态机.

可能是最神奇的方法

是使用Perl的范围运算符.. 在标量上下文中:

<块引用>

在标量上下文中,.."返回一个布尔值.该运算符是双稳态的,就像触发器一样,模拟 sed、awk 和各种编辑器的行范围(逗号)运算符.每个.."运算符都维护自己的布尔状态,即使调用包含它的子程序也是如此.只要它的左操作数为假,它就是假的.一旦左操作数为真,范围运算符保持为真,直到右操作数为真,之后范围运算符再次变为假.直到下一次计算范围运算符时,它才会变为假.

这是一个

简单例子

#!/usr/bin/env perl使用严格;使用警告;使用功能说";使用 Mojo::DOM;# slurp 所有数据行my $dom = Mojo::DOM->new(do { local $/; <DATA> });# 选择 <div id="yay"> 的所有孩子进入 Mojo::Collection我的 $yay = $dom->at('#yay')->children;# 选择有趣的(标量上下文中的'..'运算符:触发器)我的 $interesting = $yay->grep(sub { my $e = shift;$e->type eq 'h1' .. $e->type eq 'h2';});说 $interesting->join("\n");__数据__<div id="耶"><span>这没什么意思</span><h1>有趣的开始在这里</h1><strong>有趣的事情</strong><span>有趣的东西</span><h2>有趣的结尾</h2><span>这没什么意思</span>

输出

<h1>有趣的开始在这里</h1><strong>有趣的事情</strong><span>有趣的东西</span><h2>有趣的结尾</h2>

说明

所以我使用 Mojo::Collection 的 grep 过滤集合对象 $yay.由于它寻找真理,它为给定函数的返回值创建了一个标量上下文,因此 .. 运算符就像一个触发器.它在第一次看到 h1 元素后变为真,在第一次看到 h2 元素后变为假,所以你会得到标题之间的所有行包括他们自己.

因为我认为您了解一些 Perl 并且您可以将任意测试与 .. 一起使用,我希望这有助于解决您的问题!

I have html something like this

 <h1>My heading</h1>

 <p class="class1">
 <strong>SOMETHING</strong> INTERESTING (maybe not).
 </p>

 <div class="mydiv">
 <p class="class2">
 <a href="http://www.link.com">interesting link</a> </p>

 <h2>Some other heading</h2>

The content between h1 and h2 varies - I know I can use css selectors in Mojo::Dom to, say, select the content of h1 or h2, or p tags - but how to select everything between h1 and h2? Or more generally, everything between any two given sets of tags?

解决方案

It's pretty straightforward. You can just select all interesting elements in a Mojo::Collection object (this is what Mojo::DOM's children method does for example) and do some kind of a state-machine like match while iterating over that collection.

Probably the most magic way to do this

is to use Perl's range operator .. in scalar context:

In scalar context, ".." returns a boolean value. The operator is bistable, like a flip-flop, and emulates the line-range (comma) operator of sed, awk, and various editors. Each ".." operator maintains its own boolean state, even across calls to a subroutine that contains it. It is false as long as its left operand is false. Once the left operand is true, the range operator stays true until the right operand is true, AFTER which the range operator becomes false again. It doesn't become false till the next time the range operator is evaluated.

Here's a

simple example

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::DOM;

# slurp all DATA lines
my $dom = Mojo::DOM->new(do { local $/; <DATA> });

# select all children of <div id="yay"> into a Mojo::Collection
my $yay = $dom->at('#yay')->children;

# select interesting ('..' operator in scalar context: flip-flop)
my $interesting = $yay->grep(sub { my $e = shift;
    $e->type eq 'h1' .. $e->type eq 'h2';
});

say $interesting->join("\n");

__DATA__
<div id="yay">
    <span>This isn't interesting</span>
    <h1>INTERESTING STARTS HERE</h1>
    <strong>SOMETHING INTERESTING</strong>
    <span>INTERESTING TOO</span>
    <h2>END OF INTERESTING</h2>
    <span>This isn't interesting</span>
</div>

Output

<h1>INTERESTING STARTS HERE</h1>
<strong>SOMETHING INTERESTING</strong>
<span>INTERESTING TOO</span>
<h2>END OF INTERESTING</h2>

Explanation

So I'm using Mojo::Collection's grep to filter the collection object $yay. Since it looks for truth it creates a scalar context for the given function's return value and so the .. operator acts like a flip-flop. It becomes true after it first saw a h1 element and becomes false after it first saw a h2 element, so you get all lines between that headlines including themselves.

Since I think you know some Perl and you can use arbitrary tests together with .. I hope this helps to solve your problem!

这篇关于使用 Mojolicious 用户代理解析 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆