PHP没有用于XML安全实体解码的功能?没有一些xml_entity_decode? [英] PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?

查看:107
本文介绍了PHP没有用于XML安全实体解码的功能?没有一些xml_entity_decode?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:我需要一个UTF8完全编码"的XML文件.也就是说,没有实体代表符号,所有符号都由UTF8编码,只有3个XML保留的符号&" (amp),<" (lt)和>"(gt).而且,我需要一个快速完成的内置功能:将实体转换为真正的UTF8字符(而不破坏XML).
  PS:这是一个现实世界中的问题"(!);例如,在 PMC/journals 中,有280万本科学论文与特殊的XML DTD (也称为 JATS格式)...要处理为常规XML-UTF8文本",我们需要从数字实体转换为UTF8字符.

尝试的解决方案:此任务的自然功能是A&lt;B.


沮丧的解决方案

我尝试使用html_entity_decode解决(直接!)问题...因此,我将PHP更新到v5.5,以尝试使用ENT_XML1选项,

  $s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
                                                        // as I expected

也许另一个问题是,为什么没有其他选择可以做我期望的事情?" –这对许多其他XML应用程序(!)来说很重要,不仅对我而言. >


我不需要解决方法作为答案...好吧,我展示了我的丑陋函数,也许它可以帮助您理解问题,

  function xml_entity_decode($s) {
    // here an illustration (by user-defined function) 
    // about how the hypothetical PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 

    //$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+

    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
  }  // you see? not need a benchmark: 
     //  it is not so fast as direct use of html_entity_decode; if there 
     //  was an XML-safe option was ideal.

PS:在此答案之后已更正.必须为ENT_HTML5标志,才能转换真正的所有命名实体.

解决方案

此问题有时会创建一个错误答案"(请参阅​​答案).这可能是因为人们没有注意,并且是因为没有答案:缺少PHP内置解决方案.

...所以,让我重复解决方法(不是答案!),以免造成更多混乱:

最佳解决方法

注意:

  1. 下方的功能xml_entity_decode()是最佳的解决方法(优于其他方法).
  2. 下面的功能不是当前问题的答案,这只是一种解决方法.

   function xml_entity_decode($s) {
  // illustrating how a (hypothetical) PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
 }  
 


要测试并证明您有更好的解决方案,请首先使用以下简单的beckckmark进行测试:

   $countBchMk_MAX=1000;
  $xml = file_get_contents('sample1.xml'); // BIG and complex XML string
  $start_time = microtime(TRUE);
  for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){

    $A = xml_entity_decode($xml); // 0.0002

    /* 0.0014
     $doc = new DOMDocument;
     $doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
     $doc->encoding = 'UTF-8';
     $A = $doc->saveXML();
    */

  }
  $end_time = microtime(TRUE);
  echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
     ($end_time  - $start_time)/$countBchMk_MAX, 
     " seconds</h1>";
 

THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
  PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.

THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.

Illustrating the problem

Suppose

  $xmlFrag ='<p>Hello world! &#160;&#160; Let A&lt;B and A=&#x222C;dxdy</p>';

Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt not. The XML text will be (after transformed),

$xmlFrag = '<p>Hello world!    Let A&lt;B and A=∬dxdy</p>';

The text "A<B" needs an XML-reserved character, so MUST stay as A&lt;B.


Frustrated solutions

I try to use html_entity_decode for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1 option,

  $s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
                                                        // as I expected

Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.


I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,

  function xml_entity_decode($s) {
    // here an illustration (by user-defined function) 
    // about how the hypothetical PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 

    //$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+

    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
  }  // you see? not need a benchmark: 
     //  it is not so fast as direct use of html_entity_decode; if there 
     //  was an XML-safe option was ideal.

PS: corrected after this answer. Must be ENT_HTML5 flag, for convert really all named entities.

解决方案

This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.

... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:

The best workaround

Pay attention:

  1. The function xml_entity_decode() below is the best (over any other) workaround.
  2. The function below is not an answer to the present question, it is only a workwaround.

  function xml_entity_decode($s) {
  // illustrating how a (hypothetical) PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
 }  


To test and to demonstrate that you have a better solution, please test first with this simple benckmark:

  $countBchMk_MAX=1000;
  $xml = file_get_contents('sample1.xml'); // BIG and complex XML string
  $start_time = microtime(TRUE);
  for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){

    $A = xml_entity_decode($xml); // 0.0002

    /* 0.0014
     $doc = new DOMDocument;
     $doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
     $doc->encoding = 'UTF-8';
     $A = $doc->saveXML();
    */

  }
  $end_time = microtime(TRUE);
  echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
     ($end_time  - $start_time)/$countBchMk_MAX, 
     " seconds</h1>";

这篇关于PHP没有用于XML安全实体解码的功能?没有一些xml_entity_decode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆