在PHP中如何格式化DOM结构? [英] How do you format DOM structures in PHP?

查看:84
本文介绍了在PHP中如何格式化DOM结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我第一次猜测是 PHP DOM类(带有 formatOutput 参数)。但是,我无法得到这个HTML块被格式化并正确输出。您可以看到,缩进和对齐方式不正确。

  $ html ='
< html>
< body>
< div>

< div>

< div>

< p>我的最后一个段落< / p>
< div>
这是另一个文本块和一些其他的东西。< br>< br>
再次,我们将开始一个新的段落
和一些其他的东西
< br>
< / div>
< / div>
< div>
< div>
< h1>另一标题< / h1>
< / div>
< p>某些文本< / b>< / p>
< / div>
< / div>
< div>
< pre>< code>
< span>& lt; html& gt< / span>
< span>& lt&head& gt< / span>
< span>& lt; title& gt< / span>
页面标题
< span>& lt; / title& gt< / span>
< span>& lt; / head& gt< / span>
< span>& lt; / html& gt< / span>
< / code>< / pre>
< / div>
< / div>
< / body>
< / html>';

header('Content-Type:text / plain');
libxml_use_internal_errors(TRUE);

$ dom = new DOMDocument;
$ dom-> preserveWhiteSpace = false;
$ dom-> formatOutput = true;
$ dom-> loadHTML($ html);
print $ dom-> saveHTML();

更新:我向示例添加了预格式化的代码块。

解决方案

以下是对@hijarian答案的一些改进:



LibXML错误



如果不调用 libxml_use_internal_errors(true),PHP将输出所有发现的HTML错误。但是,如果调用该函数,则不会抑制错误,而是调用 libxml_get_errors()可以检查的堆栈。这样做的问题是它占用内存,DOMDocument是非常挑剔的。如果您批量处理大量文件,最终会耗尽内存。有两个解决方案:

  if(libxml_use_internal_errors(true)=== true)
{
libxml_clear_errors();
}

由于 libxml_use_internal_errors(true)返回此设置的上一个值(默认 false ),这样做只有在多次运行时才清除错误(如批处理) / p>

另一个选项是传递 LIBXML_NOERROR | LIBXML_NOWARNING 标记到 loadHTML()方法。不幸的是,由于我不知道的原因,这仍然留下了几个错误。



请记住,DOMDocument将始终输出错误(即使使用内部 libxml 错误并设置抑制标志)如果将空(或 blankish )字符串传递给 load *()方法。



正则表达式



正则表达式 /> \s *< / im 并没有太大的意义,最好使用〜> [[:space:]] ++<〜 m 还可以捕获 \v (垂直选项卡),并且只有在空格实际存在时才会替换( + 而不是 * )而不返回( ++ ) - 这是更快 - 并放下案件您可能还想将换行符归一化为 \\\
,并且其他控制字符(特别是如果HTML的起源未知),因为 \r 将会出现在&#23; 之后 saveXML()



DOMDocument :: $ preserveWhitespace 在运行上述正则表达式后无用且不必要。



我没有看到在这里保护空白的前缀标签的需要。



其他标志 loadHTML()




  • LIBXML_COMPACT - 这可能会加快您的应用程序,而无需更改代码

  • LIBXML_NOBLANKS - 需要运行更多的测试

  • LIBXML_NOCDATA - 需要运行

  • LIBXML_NOXMLDECL - 已记录但未实现=(



更新:设置任何这些选项将产生不格式化输出的效果。



saveXML()



DOMDocument :: saveXML()方法将输出XML声明,我们需要手动清除它(因为 LIBXML_NOXMLDECL 未被实现)我们可以使用 subs的组合tr()+ strpos()查找第一行中断或甚至使用正则表达式来清理它。



另一个选项,即似乎有附加的好处只是这样做:

  $ dom-> saveXML($ dom-> documentElement); 

另一件事,如果您有内联标签是空的,例如 b i li in:

 < b class =carret>< / b> 
< i class =icon-dashboard>< / i>仪表板
< li class =divider>< / li>

saveXML()方法会严重扭曲他们(将以下元素放在空的元素中),弄乱你的整个HTML。整理也有一个类似的问题,除了它只是删除节点。



要解决这个问题,您可以使用 LIBXML_NOEMPTYTAG 标记以及 saveXML()

  $ dom-> ; saveXML($ dom-> documentElement,LIBXML_NOEMPTYTAG); 

此选项将将空(也称自闭)标签转换为内联标签,并允许空的内联标签为



修复HTML [5]



有了我们迄今为止所做的所有内容,我们的HTML输出现在有两个主要问题:


  1. 没有DOCTYPE(当我们使用 $ dom-> documentElement

  2. 空标签现在是内联标签,意思是一个< br /> code>< br>< / br> )等等

第一个是相当容易的,因为HTML5是很宽容的:

 <!DOCTYPE html> \\\
。 $ dom-> saveXML($ dom-> documentElement,LIBXML_NOEMPTYTAG);

要获取我们的空标签,请执行以下操作:




  • 区域

  • base

  • basefont (在HTML5中弃用

  • br

  • col

  • 命令

  • embed

  • 框架(中不推荐使用)

  • hr

  • img

  • 输入

  • keygen

  • 链接


  • param

  • 来源

  • track

  • wbr



可以在循环中使用 str_ [i]替换

  foreach (爆炸('|','area | base | basefont | br | col | command | embed | frame | hr | img | input | keygen | link | meta | param | source | track | wbr' ag)
{
$ html = str_ireplace('> /<'。 $标签。 '>','/>',$ html);
}

或正则表达式:

  $ html = preg_replace('〜>< /(?: area | base(?:font)?| br | col | command | embed | frame | hr | img | input | keygen | link | meta | param | source | track | wbr)> \b〜i','/>',$ html); 

这是一个昂贵的操作,我没有对他们进行基准测试,所以我不能告诉你哪一个表现更好,但我会猜到 preg_replace()。此外,我不知道是否需要不区分大小写的版本。我的印象是XML标签总是较低的。



< script> < style> 标签



这些标签总是将其内容(如果存在)封装到(未注释的)CDATA块中,这可能会破坏它们的意义。您必须用正则表达式替换这些令牌。



实施



  function DOM_Tidy($ html)
{
$ dom = new \DOMDocument();

if(libxml_use_internal_errors(true)=== true)
{
libxml_clear_errors();
}

$ html = mb_convert_encoding($ html,'HTML-ENTITIES','UTF-8');
$ html = preg_replace(array('〜\R〜u','〜> [[:space:]] ++<〜m'),数组(\\\
,'> ;<'),$ html);

if((empty($ html)!== true)&&($ dom-> loadHTML($ html)=== true))
{
$ dom-> formatOutput = true;

if(($ html = $ dom-> saveXML($ dom-> documentElement,LIBXML_NOEMPTYTAG))!== false)
{
$ regex = array

'〜'。preg_quote('<![CDATA [','〜')。'〜'=>'',
'〜'。preg_quote(']] "','〜')。'〜'=>'',
'〜>< /(?: area | base(?:font)?| br | col | command | embed | frame | hr | img | input | keygen | link | meta | param | source | track | wbr)>〜'=>'/>',
);

return'<!DOCTYPE html>'。 \\\
。 preg_replace(array_keys($ regex),$ regex,$ html);
}
}

return false;
}


My first guess was the PHP DOM classes (with the formatOutput parameter). However, I cannot get this block of HTML to be formatted and output correctly. As you can see, the indention and alignment is not correct.

$html = '
<html>
<body>
<div>

<div>

        <div>

                <p>My Last paragraph</p>
            <div>
                            This is another text block and some other stuff.<br><br>
                Again we will start a new paragraph
                            and some other stuff
                            <br>
        </div>
</div>
        <div>
                        <div>
                            <h1>Another Title</h1>
                                                    </div>
                        <p>Some text again <b>for sure</b></p>
                </div>
</div>
<div>
    <pre><code>
    <span>&lt;html&gt;</span>
        <span>&lt;head&gt;</span>
            <span>&lt;title&gt;</span>
                Page Title
            <span>&lt;/title&gt;</span>
            <span>&lt;/head&gt;</span>
    <span>&lt;/html&gt;</span>
    </code></pre>
</div>
</div>
</body>
</html>';

header('Content-Type: text/plain');
libxml_use_internal_errors(TRUE);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
print $dom->saveHTML();

Update: I added a pre-formatted code block to the example.

解决方案

Here are some improvements over @hijarian answer:

LibXML Errors

If you don't call libxml_use_internal_errors(true), PHP will output all HTML errors found. However, if you call that function, the errors won't be suppressed, instead they will go to a pile that you can inspect by calling libxml_get_errors(). The problem with this is that it eats memory, and DOMDocument is known to be very picky. If you're processing lots of files in batch, you will eventually run out of memory. There are two solutions for this:

if (libxml_use_internal_errors(true) === true)
{
    libxml_clear_errors();
}

Since libxml_use_internal_errors(true) returns the previous value of this setting (default false), this has the effect of only clearing errors if you run it more than once (as in batch processing).

The other option is to pass the LIBXML_NOERROR | LIBXML_NOWARNING flags to the loadHTML() method. Unfortunately, for reasons that are unknown to me, this still leaves a couple of errors behind.

Bare in mind that DOMDocument will always output a error (even when using internal libxml errors and setting the suppressing flags) if you pass a empty (or blankish) string to the load*() methods.

Regex

The regex />\s*</im doesn't make a whole lot of sense, it's better to use ~>[[:space:]]++<~m to also catch \v (vertical tabs) and only replace if spaces actually exist (+ instead of *) without giving back (++) - which is faster - and to drop the case insensitve overhead (since whitespace has no case).

You may also want to normalize newlines to \n and other control characters (specially if the origin of the HTML is unknown), since a \r will come back as &#23; after saveXML() for instance.

DOMDocument::$preserveWhitespace is useless and unnecessary after running the above regex.

Oh, and I don't see the need to protect blank pre-like tags here. Whitespace-only snippets are useless.

Additional Flags for loadHTML()

  • LIBXML_COMPACT - "this may speed up your application without needing to change the code"
  • LIBXML_NOBLANKS - need to run more tests on this one
  • LIBXML_NOCDATA - need to run more tests on this one
  • LIBXML_NOXMLDECL - documented, but not implemented =(

UPDATE: Setting any of these options will have the effect of not formatting the output.

On saveXML()

The DOMDocument::saveXML() method will output the XML declaration. We need to manually purge it (since the LIBXML_NOXMLDECL isn't implemented). To do that, we could use a combination of substr() + strpos() to look for the first line break or even use a regex to clean it up.

Another option, that seems to have an added benefit is simply doing:

$dom->saveXML($dom->documentElement);

Another thing, if you have inline tags are are empty, such as the b, i or li in:

<b class="carret"></b>
<i class="icon-dashboard"></i> Dashboard
<li class="divider"></li>

The saveXML() method will seriously mangle them (placing the following element inside the empty one), messing your whole HTML. Tidy also has a similar problem, except that it just drops the node.

To fix that, you can use the LIBXML_NOEMPTYTAG flag along with saveXML():

$dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

This option will convert empty (aka self-closing) tags to inline tags and allow empty inline tags as well.

Fixing HTML[5]

With all the stuff we did so far, our HTML output has two major problems now:

  1. no DOCTYPE (it was stripped when we used $dom->documentElement)
  2. empty tags are now inline tags, meaning one <br /> turned into two (<br></br>) and so on

Fixing the first one is fairly easy, since HTML5 is pretty permissive:

"<!DOCTYPE html>\n" . $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

To get our empty tags back, which are the following:

  • area
  • base
  • basefont (deprecated in HTML5)
  • br
  • col
  • command
  • embed
  • frame (deprecated in HTML5)
  • hr
  • img
  • input
  • keygen
  • link
  • meta
  • param
  • source
  • track
  • wbr

We can either use str_[i]replace in a loop:

foreach (explode('|', 'area|base|basefont|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr') as $tag)
{
    $html = str_ireplace('>/<' . $tag . '>', ' />', $html);
}

Or a regular expression:

$html = preg_replace('~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>\b~i', '/>', $html);

This is a costly operation, I haven't benchmarked them so I can't tell you which one performs better but I would guess preg_replace(). Additionally, I'm not sure if the case insensitive version is needed. I'm under the impression that XML tags are always lowercased. UPDATE: Tags are always lowercased.

On <script> and <style> Tags

These tags will always have their content (if existent) encapsulated into (uncommented) CDATA blocks, which will probably break their meaning. You'll have to replace those tokens with a regular expression.

Implementation

function DOM_Tidy($html)
{
    $dom = new \DOMDocument();

    if (libxml_use_internal_errors(true) === true)
    {
        libxml_clear_errors();
    }

    $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
    $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);

    if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
    {
        $dom->formatOutput = true;

        if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
        {
            $regex = array
            (
                '~' . preg_quote('<![CDATA[', '~') . '~' => '',
                '~' . preg_quote(']]>', '~') . '~' => '',
                '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
            );

            return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
        }
    }

    return false;
}

这篇关于在PHP中如何格式化DOM结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆