Perl add< a</a> HTML/XML标记内的单词周围 [英] Perl add <a></a> around words within an HTML/XML tag

查看:72
本文介绍了Perl add< a</a> HTML/XML标记内的单词周围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的文件格式:

I have a file formatted like this one:

Eye color
<p class="ul">Eye color, color</p> <p class="ul1">blue, cornflower blue, steely blue</p> <p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> <p class="ul1">musteline</p> <link rel="stylesheet" href="a.css">
</>

,分隔的<p class="ul1">中的每个单词都应包装在<a>标记中,如下所示:

Each word within the <p class="ul1"> separated by ,should be wrapped in an <a> tag, like this:

Eye color
<p class="ul">Eye color, color</p> <p class="ul1"><a href="entry://blue">blue</a>, <a href="entry://cornflower blue">cornflower blue</a>, <a href="entry://steely blue">steely blue</a></p> <p class="ul1"><a href="entry://velvet brown">velvet brown</a></p> <link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> <p class="ul1"><a href="entry://musteline">musteline</a></p> <link rel="stylesheet" href="a.css">
</>

<p class="ul1">标记内可能有一个或几个单词.

There could be one or several words within the <p class="ul1"> tag.

Perl单排纸有可能吗?

先谢谢了.感谢您的帮助.

Thanks in advance. Any help is appreciated.

推荐答案

使用模块解析文件并遍历所需的元素(类ul1<p>).从每个中提取那些用逗号分隔的短语,并在它们周围包装链接;然后用新内容替换元素.最后将更改后的树写出来.

Parse the file using a module and iterate over the elements you need (<p> of class ul1). Extract those comma-separated phrases from each and wrap links around them; then replace the element with that new content. Write the changed tree out in the end.

使用 HTML :: TreeBuilder (其主力 HTML :: Element )

use warnings;
use strict;
use feature 'say';

use HTML::Entities;
use HTML::TreeBuilder;

my $file = shift // die "Usage: $0 file\n";

my $tree = HTML::TreeBuilder->new_from_file($file);

foreach my $elem ($tree->look_down(_tag => "p", class => "ul1")) {   
    my @new_content;
    for ($elem->content_list) { 
        my @w = split /\s*,\s*/; 
        my $wrapped = join ", ", 
            map { qq(<a href="entry://$_">).$_.q(</a>) } @w; 
        push @new_content, $wrapped;
    }
    $elem->delete_content;
    $elem->push_content( @new_content );
}; 

say decode_entities $tree->as_HTML; 

在您的情况下,元素($elem)在content_list中只有一项,因此您不必将修改后的内容收集到数组(@new_content)中,而只能处理一件,简化了代码.使用上面的列表当然不会受到伤害.

In your case an element ($elem) will have one item in the content_list so you don't have to collect modified content into an array (@new_content) but can process that one piece only, what simplifies the code. Working with a list as above doesn't hurt of course.

我将此程序的输出重定向到.html文件.生成的文件在换行符上很节俭.如果很漂亮的HTML很重要,请使用 HTML :: Tidy HTML :: PrettyPrinter .

I redirect the output of this program to an .html file. The generated file is qouite frugal on newlines. If pretty HTML matters make a pass with a tool like HTML::Tidy or HTML::PrettyPrinter.

单线吗?不,太多了.并且请不要使用正则表达式,因为这会给您带来麻烦.它需要紧密的工作才能正确处理,容易陷入越野车,对最小的细节敏感,并且即使输入的更改很小,也很脆弱.这就是它可以进行的工作.有图书馆的原因.

In a one-liner? Nah, it's too much. And please don't use regex as there's trouble down the road; it needs close work to get it right, is easy to end up buggy, is sensitive to smallest details, and brittle for even slightest changes in input. And that's when it can do the job. There are reasons for libraries.

完成这项工作的另一个好工具是 Mojo :: DOM .例如

Another good tool for this job is Mojo::DOM. For example

use Mojo::DOM;
use Path::Tiny;  # only to read the file into a string easily

my $html = path($file)->slurp;

my $dom = Mojo::DOM->new($html);

foreach my $elem ($dom->find('p.ul1')->each) {
    my @w = split /,/, $elem->text;
    my $new = join ', ',
        map { qq(<a href="entry://$_">).$_.q(</a>) } @w;
    $elem->replace( $new );
}

say $dom;

产生与上述相同的HTML(效果更好,请注意无需处理实体).

Produces the same HTML as above (just nicer, and note no need to deal with entities).

较新的模块版本提供 new_tag 方法,通过该方法可以创建上面的附加链接

Newer module versions provide new_tag method with which the additional link above is made as

my $new = join ', ', 
   map { $e->new_tag('a', 'href' => "entry://$_", $_) } @w; 

可以满足一些细微的需求(HTML转义为一个).添加此方法时,主要文档不说,请参见更改日志(2018年5月,据称在v5.28中;它适用于我的5.29.2).

what takes care of some subtle needs (HTML escaping for one). The main docs don't say when this method was added, see changelog (May 2018, so supposedly in v5.28; it works with my 5.29.2).

我将显示的示例填充到此文件中进行测试:

I padded the shown sample to this file for testing:

<!DOCTYPE html>  <title>Eye color</title> <body>
<p class="ul">Eye color, color</p> 
<p class="ul1">blue, cornflower blue, steely blue</p> 
<p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css"></>
weasel
<p class="ul">weasel</p> 
<p class="ul1">musteline</p> <link rel="stylesheet" href="a.css"></>
</body> </html>


更新已经明确的是,给定的标记片段不仅是大概完整的HTML文档的一部分,而且还是一个使用HTML的自定义格式的文件(如所述),如图所示.除了必需的更改之外,其余所有都需要保留.


Update   It's been clarified that the given markup snippet isn't merely a fragment of a presumably full HTML document but that it is a file (as stated) that stands as shown, as a custom format using HTML; apart from the required changes the rest of it need be preserved.

一个特别令人不愉快的细节被证明是</>部分; HTML::TreeBuilderMojo::DOMXML::LibXML 中的每一个在解析时都将其丢弃.我找不到让他们坚持下去的方法.

A particularly unpleasant detail proves to be the </> part; each of HTML::TreeBuilder, Mojo::DOM, and XML::LibXML discards it while parsing. I couldn't find a way to make them keep that piece.

它是 Marpa :: HTML 根据需要处理了整个片段,更改了要询问的内容,而其余部分则保留下来.

It was Marpa::HTML that processed the whole fragment as required, changing what was asked while leaving alone the rest of it.

use warnings;
use strict;
use feature 'say';
use Path::Tiny;

use Marpa::HTML qw(html);

my $file = shift // die "Usage: $0 file\n";
my $html = path($file)->slurp;

my $marpa = Marpa::HTML::html( 
    \$html,
    {
        'p.ul1' => sub {
            return join ', ', 
                map { qq(<a href="entry://$_">).$_.q(</a>) } 
                split /\s*,\s*/, Marpa::HTML::contents();
        },
    }
);  

say $$marpa; 

ul1类的<p>标记的处理与以前相同:将内容拆分为逗号,将每段内容包装为<a>标记,然后使用,

The processing of the <p> tags of class ul1 is the same as before: split the content on comma and wrap each piece into an <a> tag, then join them back with ,

此打印(添加了换行符和缩进以提高可读性)

This prints (with added line-breaks and indentation for readability)

Eye color
<p class="ul">Eye color, color</p> 
<a href="entry://blue">blue</a>, 
    <a href="entry://cornflower blue">cornflower blue</a>, 
    <a href="entry://steely blue">steely blue</a> 
    <a href="entry://velvet brown">velvet brown</a> 
<link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> <a href="entry://musteline">musteline</a> 
<link rel="stylesheet" href="a.css">
</>

此模块的整体方法适用于此类任务

It is the overall approach of this module that is suited for a task like this

Marpa::HTML是一个非常宽松的HTML解析器. Marpa::HTML不会拒绝任何文档,也不会满足HTML标准的要求.

Marpa::HTML is an extremely liberal HTML parser. Marpa::HTML does not reject any documents, no mater how poorly they fit the HTML standards.

此处处理了自定义的类似HTML的标记,将</>之类的内容保留在原处.

Here it processed a custom piece of HTML-like markup, leaving things like </> in place.

  请参阅这篇文章,以示例非常宽容地使用XML::LibXML

  See this post for an example of very permissive processing of HTML with XML::LibXML

这篇关于Perl add&lt; a&lt;/a&gt; HTML/XML标记内的单词周围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆