html片段上的loadHTML LIBXML_HTML_NOIMPLIED生成不正确的标记 [英] loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

查看:130
本文介绍了html片段上的loadHTML LIBXML_HTML_NOIMPLIED生成不正确的标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用LIBXML_HTML_NOIMPLIED标志和html片段会产生不正确的标签:

  $ str ='< p> Lorem ipsum dolor坐下来,< / p>< p> Nunc vel vehicula ante。< / p>'; 
$ doc = new DOMDocument();
$ doc-> loadHTML($ str,LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $ doc-> saveHTML();

输出:

 < p> Lorem ipsum dolor sit amet。< p> Nunc vel vehicula ante。< / p>< / p> 

我发现黑客可以使用正则表达式来解决这个问题,但这违背了使用DOM的目的。我已经测试了几个版本的libxml和php,最新的版本是libxml 2.9.2,php 5.6.7(Debian Jessy)。任何建议的赞赏。

解决方案

重新安排由 LIBXML_HTML_NOIMPLIED 选项,你正在使用。看起来它对你的情况不够稳定。



另外,你可能不想将它用于可移植性的原因,例如我有一个带有Libxml的PHP 5.4.36 2.7.8当前 不支持 LIBXML_HTML_NOIMPLIED (Libxml> = 2.7.7),但 LIBXML_HTML_NODEFDTD (Libxml> = 2.7.8)选项。



我知道这种处理方式。加载片段时,将其包装到< div> 元素中:

  $ doc-> loadHTML( < DIV> $ STR< / DIV> 中); 

这有助于指导 DOMDocument 的结构。



然后,您可以从文档中提取此容器:

  $ container = $ doc->的getElementsByTagName( 'DIV') - >项(0); 
$ container = $ container-> parentNode-> removeChild($ container);

然后从文档中删除所有的孩子:

  while($ doc-> firstChild){
$ doc-> removeChild($ doc-> firstChild);
}

现在文档完全是空的,您现在可以再次追加儿童。幸运的是,我们之前删除了< div> 容器元素,因此我们可以添加它:

  while($ container-> firstChild){
$ doc-> appendChild($ container-> firstChild);
}

然后可以使用已知的 saveHTML 方法:

  echo $ doc-> saveHTML(); 

在您的方案中给出:

 < p> Lorem ipsum dolor sit amet。< / p>< p> Nunc vel vehicula ante。< / p> 

这种方法与现场的现有材料有些不同(参见下面的参考文献) ,所以这个例子一次:

  $ str ='< p> Lorem ipsum dolor sit amet。< / p> < p> Nunc vel vehicula ante。< / p>'; 

$ doc = new DOMDocument();
$ doc-> loadHTML(< div> $ str< / div>);

$ container = $ doc-> getElementsByTagName('div') - > item(0);
$ container = $ container-> parentNode-> removeChild($ container);
while($ doc-> firstChild){
$ doc-> removeChild($ doc-> firstChild);
}

while($ container-> firstChild){
$ doc-> appendChild($ container-> firstChild);
}

echo $ doc-> saveHTML();

我也非常推荐关于如何在没有HTML包装的情况下保存DOMDocumentHTML的参考问题进一步阅读以及关于inner-html的内容 参考


Using the LIBXML_HTML_NOIMPLIED flag with an html fragment generates incorrect tags:

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';
$doc = new DOMDocument();
$doc->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $doc->saveHTML();

Outputs:

<p>Lorem ipsum dolor sit amet.<p>Nunc vel vehicula ante.</p></p>

I have found hacks to work around this using regexes, but that defeats the purpose of using DOM. I have tested this with several versions of libxml and php, the latest with libxml 2.9.2, php 5.6.7 (Debian Jessy). Any suggestions appreciated.

解决方案

The re-arrangement is done by the LIBXML_HTML_NOIMPLIED option you're using. Looks like it's not stable enough for your case.

Also you might want to not use it for portablility reasons, for example I've got one PHP 5.4.36 with Libxml 2.7.8 at hand that is not supporting LIBXML_HTML_NOIMPLIED (Libxml >= 2.7.7) but later LIBXML_HTML_NODEFDTD (Libxml >= 2.7.8) option.

I know this way of dealing with it. When you load the fragment, you wrap it into a <div> element:

$doc->loadHTML("<div>$str</div>");

This helps to guide DOMDocument on the structure you want.

You can then extract this container from the document itself:

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);

And then remove all children from the document:

while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

Now the document is completely empty and you're now able to append children again. Luckily there is the <div> container element we removed earlier, so we can add from it:

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

The fragment then can be retrieved with the known saveHTML method:

echo $doc->saveHTML();

Which gives in your scenario:

<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>

This methodology is a little different from the existing material here on site (see the references I give below), so the example at once:

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';

$doc = new DOMDocument();
$doc->loadHTML("<div>$str</div>");

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

echo $doc->saveHTML();

I also really recommend the reference question on How to saveHTML of DOMDocument without HTML wrapper? for a further read as well as the one about inner-html

References

这篇关于html片段上的loadHTML LIBXML_HTML_NOIMPLIED生成不正确的标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆