如何防止将文档类型添加到HTML? [英] How to prevent the doctype from being added to the HTML?

查看:69
本文介绍了如何防止将文档类型添加到HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用DOM处理这个tidy-up-messy-html标记,但是现在我意识到了一个更大的问题,

I have been working on this tidy-up-messy-html tags with DOM, but now I realise a bigger problem,

$content = '<p><a href="#">this is a link</a></p>';

function tidy_html($content,$allowable_tags = null, $span_regex = null)
{      
    $dom = new DOMDocument();
    $dom->loadHTML($content);

        // other codes
    return $dom->saveHTML();
}

echo tidy_html($content);

它将输出整个DOM,

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 
<html><body><p><a href="#">this is a link</a></p></body></html> 

但是我只想要这样的东西,

but I only want something like this in the return,

<p><a href="#">this is a link</a></p>

我不想要,

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 
    <html><body>...</body></html>

这可能吗?

编辑:

innerHTML 模拟会在我的数据库中生成一些奇怪的代码,例如&#13; ’ p>

the innerHTML simulation generates some strange codes in my database, like &#13; ,  , ’

<p>Monday July 5th 10am - 3.30pm £20</p>&#13;
<p>Be one of the first visitors to the ...at this special event.Â</p>&#13;
<p>All participants will receive a free copy of the ‘Contemporary Art Kit’ produced exclusively for Art on....</p>&#13;

innerHTML 模拟,

$innerHHTML = '';
$nodeBody = $dom->getElementsByTagName('body')->item(0);
foreach($nodeBody->childNodes as $child) {
  $innerHTML .= $nodeBody->ownerDocument->saveXML($child);
}

我发现出现中断时会产生奇怪代码的原因是由于 saveXML($ child)

I found out that the reason it creates the strange codes when there is a break is caused by saveXML($child)

所以当我有这样的内容时,

So when I have something like this,

$content = '<p><br/><a href="#">xx</a></p>
<p><br/><a href="#">xx</a></p>';

它将返回类似这样的内容,

It will return something like this,

<p><a href="#">xx</a></p>&#13;
<p><a href="#">xx</a></p>

但是我实际上想要这个东西,

But I want something this actually,

<p><a href="#">xx</a></p>
<p><a href="#">xx</a></p>


推荐答案

如果您正在处理片段,通常只需要正文内容。

If you're working on a fragment, you normally need only the body contents.

PHP中的DomDocument不提供 innerHTML 之类的东西。但是,您可以模拟它:

DomDocument in PHP does not offer something like innerHTML. You can simulate it however:

$innerHHTML = '';
$nodeBody = $dom->getElementsByTagName('body')->item(0);
foreach($nodeBody->childNodes as $child) {
  $innerHTML .= $nodeBody->ownerDocument->saveXML($child);
}

如果您只想修复片段,则可以使用<还有href = http://php.net/manual/en/book.tidy.php rel = nofollow>整洁的库:

If you just want to repair a fragment, you can make use of the tidy library as well:

$html = tidy_repair_string($html, array('output-xhtml'=>1,'show-body-only'=>1));

这篇关于如何防止将文档类型添加到HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆