使用Mojo :: DOM处理HTML文档时,如何最可靠地保留HTML实体? [英] How do I most reliably preserve HTML Entities when processing HTML documents with Mojo::DOM?

查看:111
本文介绍了使用Mojo :: DOM处理HTML文档时,如何最可靠地保留HTML实体?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Mojo :: DOM 来识别并打印出短语(意思是字符串我从Movable Type内容管理系统中的现有内容中提取的数百个HTML文档中的选定HTML标签之间的文本标记.

I'm using Mojo::DOM to identify and print out phrases (meaning strings of text between selected HTML tags) in hundreds of HTML documents that I'm extracting from existing content in the Movable Type content management system.

我正在将这些短语写到文件中,以便可以将它们翻译成其他语言,如下所示:

I'm writing those phrases out to a file, so they can be translated into other languages as follows:

        $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));

    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    ##########

        print FILE "\n\t### Body\n\n";        

        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {

            print_phrase($phrase); # utility function to write out the phrase to a file

        }

当Mojo :: DOM遇到嵌入式HTML实体(例如™ )时,它将这些实体转换为编码的字符,而不是像书写时那样传递.我希望实体按书面规定通过.

When Mojo::DOM encountered embedded HTML entities (such as ™ and  ) it converted those entities into encoded characters, rather than passing along as written. I wanted the entities to be passed through as written.

我认识到我可以使用Mojo :: Util :: decode将这些HTML实体传递到我正在编写的文件中.问题是"您只能在包含以下内容的字符串上调用解码"UTF-8"有效的UTF-8.如果没有,例如,因为已经将其转换为Perl字符,它将返回undef."

I recognized that I could use Mojo::Util::decode to pass these HTML entities through to the file I'm writing. The problem is "You can only call decode 'UTF-8' on a string that contains valid UTF-8. If it doesn't, for example because it is already converted to Perl characters, it will return undef."

在这种情况下,我必须设法弄清楚如何在调用Mojo::Util::decode('UTF-8', $page->text)之前测试当前HTML页面的编码,或者必须使用其他技术来保留编码的HTML实体.

If this is the case, I have to either try to figure out how to test the encoding of the current HTML page before calling Mojo::Util::decode('UTF-8', $page->text), or I must use some other technique to preserve the encoded HTML entities.

在使用Mojo :: DOM处理HTML文档时,如何最可靠地保留编码的HTML实体?

How do I most reliably preserve encoded HTML Entities when processing HTML documents with Mojo::DOM?

推荐答案

通过测试,我和我的同事能够确定Mojo::DOM->new()正在自动解码&符号(&),从而将HTML实体保留为写不可能.为了解决这个问题,我们添加了以下子例程以对&"号进行双重编码:

Through testing, my colleagues and I were able to determine that Mojo::DOM->new() was decoding ampersand characters (&) automatically, rendering the preservation of HTML Entities as written impossible. To get around this, we added the following subroutine to double encode ampersand:

sub encode_amp {
    my ($text) = @_;

    ##########
    #
    # We discovered that we need to encode ampersand
    # characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
    # automatically by Mojo::DOM::Util::html_unescape().
    #
    # What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
    # any incoming ampersand or & characters.
    #
    #
    ##########   

    $text .= '';           # Suppress uninitialized value warnings
    $text =~ s!&!&!g;  # HTML encode ampersand characters
    return $text;
}

稍后在脚本中,我们在实例化新的Mojo::DOM对象时将$page->text通过encode_amp()传递.

Later in the script we pass $page->text through encode_amp() as we instantiate a new Mojo::DOM object.

    $dom = Mojo::DOM->new(encode_amp($page->text));

##########
#
# Break down the Body into phrases. This is done by listing the tags and tag combinations that
# surround each block of text that we're looking to capture.
#
# Note that "h2 b" is an important tag combination for capturing major headings on pages
# in this theme. The tags "span" and "a" are also.
#
# We added caption and th to support tables.
#
# We added li and li a to support ol (ordered lists) and ul (unordered lists).
#
# We got the complicated map('descendant_nodes') logic from @Grinnz on StackOverflow, see:
# https://stackoverflow.com/questions/55130871/how-do-i-most-reliably-preserve-html-entities-when-processing-html-documents-wit#comment97006305_55131737
#
#
# Original set of selectors in $dom->find() below is as follows:
#   'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
#
##########

    print FILE "\n\t### Body\n\n";        

    for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
        map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {           

        print_phrase($phrase);

    }

上面的代码块包含了@Grinnz的先前建议,如该问题的注释所示.也感谢@Robert的回答,它很好地观察了Mojo::DOM的工作方式.

The code block above incorporates previous suggestions from @Grinnz as seen in the comments in this question. Thanks also to @Robert for his answer, which had a good observation about how Mojo::DOM works.

此代码绝对适用于我的应用程序.

This code definitely works for my application.

这篇关于使用Mojo :: DOM处理HTML文档时,如何最可靠地保留HTML实体?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆