从Perl中的HTMl/XML标签提取文本 [英] Extract text from HTMl/XML tags in Perl

查看:54
本文介绍了从Perl中的HTMl/XML标签提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的HTTPS响应

I have a HTTPS response like this

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>Some tittle &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;

</title>
    </head>
    <body>
        <h2>Some h2</h2>
        <p>some text:

            <pre>    text &lt;localconfig&gt;
  &lt;key name="ssl_default"&gt;
    &lt;value&gt;sha256&lt;/value&gt;
  &lt;/key&gt;
  &lt;key name="some variable"&gt;
    &lt;value&gt;1024&lt;/value&gt;
  &lt;/key&gt;
&lt;/localconfig&gt;
</pre>
        </p>
        <hr>
        <i>
            <small>Some text</small>
        </i>
        <hr/>
    </body>
</html>

  • 键的名称是静态的,我需要使用一个变量来获取特定的值.
  • 我正在使用 decide_entities 将文本解析为html
  • 有时该键在响应中被发布两次,但是它的值是相同的.
    • The key's name are statics, and i need to use a variable to grab specific values.
    • I'm using decide_entities to parse the text to html
    • Sometimes the key is posted twice in the response, but it's the same value.
    • XML :: LibXML 在这里没有太大帮助,因为它不是正确的XML文件/字符串.

      XML::LibXML don't help much here since it's not a correct XML file/string.

      我试图用正则表达式来得到它

      I tried to use Regex to get it like this

      sub get_key {
          my $start = '<key name="'.$_[0].'">\n<value>';
          print $_[1];
          my $end = "</value>";
          print " [*] Trying to get $_[0]\n";
          print "Start: $start  --- End $end";
          if($_[1] =~ /\b$start\b(.*?)\b$end\b/s){
              my $result = $1;
              print $result, "\n\n";
              return $result;
          }
      }
      
      get_key("string_to_search", $string_from_response);
      

      我需要提取键和值之间的键

      I need to extract the key between the key and value

      <key name="variable">
       <value>Grab me</value>
      </key>
      

      推荐答案

      提取嵌入式XML文档后,应使用适当的XML解析器.

      Once you've extracted the embedded XML document, you should use a proper XML parser.

      use XML::LibXML qw( );
      
      my $xml_doc = XML::LibXML->new->parse_string($xml);
      
      for my $key_node ($xml_doc->findnodes("/localconfig/key")) {
         my $key = $key_node->getAttribute("name");
         my $val = $key_node->findvalue("value/text()");
         say "$key: $val";
      }
      


      因此,剩下的问题是如何提取XML文档.


      So that leaves us with the question how to extract the XML document.

      选项1:XML :: LibXML

      您可以使用XML :: LibXML并简单地告诉它忽略该错误(伪造的</p> 标记).

      You could use XML::LibXML and simply tell it to ignore the error (the spurious </p> tag).

      my $html_doc = XML::LibXML->new( recover => 2 )->parse_html_fh($html);
      my $xml = encode_utf8( $html_doc->findvalue('/html/body/pre/text()') =~ s/^[^<]*//r );
      

      选项2:正则表达式匹配

      您可能会避免使用正则表达式模式匹配.

      You could probably get away with using a regex pattern match.

      use HTML::Entities qw( decode_entities );
      
      my $xml = decode_entities( ( $html =~ m{<pre>[^&]*(.*?)</pre>}s )[0] );
      

      选项3:Mojo :: DOM

      您可以使用Mojo :: DOM提取嵌入式XML文档.

      You could use Mojo::DOM to extract the embedded XML document.

      use Encode    qw( decode encode_utf8 );
      use Mojo::DOM qw( );
      
      my $decoded_html = decode($encoding, $html);
      my $html_doc = Mojo::DOM->new($decoded_html);    
      my $xml = encode_utf8( $html_doc->at('html > body > pre')->text =~ s/^[^<]*//r );
      

      Mojo :: DOM的问题在于,在将文档传递给解析器之前,您需要了解文档的编码(因为必须将其解码后传递),但是您需要对文档进行解析以便提取文件的编码形式.

      The problem with Mojo::DOM is that you need to know the encoding of the document before you pass the document to the parser (because you must pass it decoded), but you need to parse the document in order to extract the encoding of the document form the document.

      (当然,您也可以使用Mojo :: DOM来解析XML.)

      (Of course, you could use Mojo::DOM to parse the XML too.)

      请注意,HTML片段< p>< pre>//pre></p> 表示< p></p> pre></pre> ,并且XML :: LibXML和Mojo :: DOM都可以正确处理此问题.

      Note that the HTML fragment <p><pre></pre></p> means <p></p><pre></pre>, and both XML::LibXML and Mojo::DOM handle this correctly.

      这篇关于从Perl中的HTMl/XML标签提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆