DOM解析器,允许HTML5样式< / in< script>标签 [英] DOM parser that allows HTML5-style </ in <script> tag
问题描述
更新: html5lib
(底层的问题)似乎很接近,我只需要提高我对如何使用的理解。 p>
我正在尝试为PHP 5.3找到一个HTML5兼容的DOM解析器。特别是,我需要在脚本代码中访问以下类似HTML的CDATA:
< script type =text / x-jquery-tmplid =foo>
< table>< tr>< td> $ {name}< / td>< / tr>< / table>
< / script>
大多数解析器将提前结束解析,因为HTML 4.01 结束脚本标签解析找到ETAGO(
)内的< script>
标签。但是,HTML5 允许< /
在< / script>
之前。我迄今为止尝试的所有解析器都失败了,或者他们的文档记录不全,我没有想到它们是否工作。
我的要求:
- 真正的解析器,而不是正则表达式黑客。
- 加载完整页面或HTML片段的能力
- 可以撤销脚本内容,按标签的id属性进行选择。
输入:
< script id =foo>< td> bar< / td> < /脚本>
输出失败的示例(无结束< / td>
):
< script id =foo>< td> bar< / script>
某些解析器及其结果:
DOMDocument (failed)
资料来源:
<?php
header('Content-type:text / plain');
$ d = new DOMDocument;
$ d-> loadHTML('< script id =foo>< td> bar< / td>< / script>');
echo $ d-> saveHTML();
输出:
code>警告:DOMDocument :: loadHTML():意外的结束标签:实体中的td,行:1在/home/adam/public_html/2010/10/26/dom.php第5行
< ;!DOCTYPE html PUBLIC - // W3C // DTD HTML 4.0 Transitional // ENhttp://www.w3.org/TR/REC-html40/loose.dtd\">
< html>< head>< script id =foo>< td> bar< / script>< / head>< / html>
FluentDOM (failed)
资料来源:
<?php
header('Content-type:text / plain');
require_once'FluentDOM / src / FluentDOM.php';
$ html =< html>< head>< / head>< body>< script id ='foo'>< td>< / td>< / script>< ; /体>< / HTML>中;
echo FluentDOM($ html,'text / html');
输出:
code><!DOCTYPE html PUBLIC - // W3C // DTD HTML 4.0 Transitional // ENhttp://www.w3.org/TR/REC-html40/loose.dtd\">
< html>< head>< / head>< body>< script id =foo>< td>< / script>< / body>< / html>
phpQuery (failed)
资料来源:
<?php
标题('Content-type:text / plain');
require_once'phpQuery.php';
phpQuery :: newDocumentHTML(<< EOF
< script type =text / x-jquery-tmplid =foo>
< ; td> test< / td>
< / script>
EOF
);
echo(string)pq('#foo');
输出:
< script type =text / x-jquery-tmplid =foo >
< td> test
< / script>
html5lib (通过)
可能有希望。我可以得到脚本#foo
标签的内容吗?
资料来源:
<?php
标题('Content-type:text / plain');
包含HTML5 / Parser.php;
$ html =<!DOCTYPE html>< html>< head>< / head>< body>< script id ='foo'>< td>< ; / TD>< /脚本>< /体>< / HTML>中;
$ d = HTML5_Parser :: parse($ html);
echo $ d-> saveHTML();
输出:
code>< html>< head>< / head>< body>< script id =foo>< td>< / td>< / script>< / body> ;< / HTML>
我有同样的问题,显然你可以黑客通过将文档加载为XML,并将其保存为HTML:)
$ d = new DOMDocument;
$ d-> loadXML('< script id =foo>< td> bar< / td>< / script>');
echo $ d-> saveHTML();
但是当然,对于loadXML,标记必须是无错误的。
Update: html5lib
(bottom of question) seems to get close, I just need to improve my understanding of how it's used.
I am attempting to find an HTML5-compatible DOM parser for PHP 5.3. In particular, I need to access the following HTML-like CDATA within a script tag:
<script type="text/x-jquery-tmpl" id="foo">
<table><tr><td>${name}</td></tr></table>
</script>
Most parsers will end parsing prematurely because HTML 4.01 ends script tag parsing when it finds ETAGO (</
) inside a <script>
tag. However, HTML5 allows for </
before </script>
. All of the parsers I have tried so far have either failed, or they are so poorly documented that I haven't figured out if they work or not.
My requirements:
- Real parser, not regex hacks.
- Ability to load full pages or HTML fragments.
- Ability to pull script contents back out, selecting by the tag's id attribute.
Input:
<script id="foo"><td>bar</td></script>
Example of failing output (no closing </td>
):
<script id="foo"><td>bar</script>
Some parsers and their results:
DOMDocument (fails)
Source:
<?php
header('Content-type: text/plain');
$d = new DOMDocument;
$d->loadHTML('<script id="foo"><td>bar</td></script>');
echo $d->saveHTML();
Output:
Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><script id="foo"><td>bar</script></head></html>
FluentDOM (fails)
Source:
<?php
header('Content-type: text/plain');
require_once 'FluentDOM/src/FluentDOM.php';
$html = "<html><head></head><body><script id='foo'><td></td></script></body></html>";
echo FluentDOM($html, 'text/html');
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head></head><body><script id="foo"><td></script></body></html>
phpQuery (fails)
Source:
<?php
header('Content-type: text/plain');
require_once 'phpQuery.php';
phpQuery::newDocumentHTML(<<<EOF
<script type="text/x-jquery-tmpl" id="foo">
<td>test</td>
</script>
EOF
);
echo (string)pq('#foo');
Output:
<script type="text/x-jquery-tmpl" id="foo">
<td>test
</script>
html5lib (passes)
Possibly promising. Can I get at the contents of the script#foo
tag?
Source:
<?php
header('Content-type: text/plain');
include 'HTML5/Parser.php';
$html = "<!DOCTYPE html><html><head></head><body><script id='foo'><td></td></script></body></html>";
$d = HTML5_Parser::parse($html);
echo $d->saveHTML();
Output:
<html><head></head><body><script id="foo"><td></td></script></body></html>
I had the same problem and apparently you can hack your way trough this by loading the document as XML, and save it as HTML :)
$d = new DOMDocument;
$d->loadXML('<script id="foo"><td>bar</td></script>');
echo $d->saveHTML();
But of course the markup must be error-free for loadXML to work.
这篇关于DOM解析器,允许HTML5样式< / in< script>标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!