JavaScript中严格的HTML解析 [英] Strict HTML parsing in JavaScript
问题描述
在Google Chrome(金丝雀)上,似乎没有字符串可以使DOM解析器失败.我正在尝试解析一些HTML,但是如果HTML并非完全100%有效,我希望它显示错误.我已经尝试了显而易见的方法:
On Google Chrome (Canary), it seems no string can make the DOM parser fail. I'm trying to parse some HTML, but if the HTML isn't completely, 100%, valid, I want it to display an error. I've tried the obvious:
var newElement = document.createElement('div');
newElement.innerHTML = someMarkup; // Might fail on IE, never on Chrome.
我还尝试了此问题中的方法.不会因为无效标记而失败,即使是我可以产生的最无效的标记.
I've also tried the method in this question. Doesn't fail for invalid markup, even the most invalid markup I can produce.
那么,至少有某种方法可以在Google Chrome浏览器中严格"解析HTML吗?我不想自己动手或使用外部验证实用程序对它进行标记.如果没有其他选择,则可以使用严格的XML解析器,但是某些元素不需要HTML中的结束标记,而且最好不要失败.
So, is there some way to parse HTML "strictly" in Google Chrome at least? I don't want to resort to tokenizing it myself or using an external validation utility. If there's no other alternative, a strict XML parser is fine, but certain elements don't require closing tags in HTML, and preferably those shouldn't fail.
推荐答案
使用 DOMParser
分两个步骤检查文档:
Use the DOMParser
to check a document in two steps:
- 通过将其解析为XML来验证文档是否符合XML.
- 将字符串解析为HTML.这需要对DOMParser进行修改.
遍历每个元素,并检查DOM元素是否为HTMLUnknownElement
的实例.为此,getElementsByTagName('*')
非常适合.
(如果要严格解析文档,则必须递归遍历每个元素,并记住该元素是否为
- Validate whether the document is XML-conforming, by parsing it as XML.
- Parse the string as HTML. This requires a modification on the DOMParser.
Loop through each element, and check whether the DOM element is an instance ofHTMLUnknownElement
. For this purpose,getElementsByTagName('*')
fits well.
(If you want to strictly parse the document, you have to recursively loop through each element, and remember whether the element is allowed to be placed at that location. Eg.<area>
in<map>
)
演示: http://jsfiddle.net/q66Ep/1/
/* DOM parser for text/html, see https://stackoverflow.com/a/9251106/938089 */
;(function(DOMParser) {"use strict";var DOMParser_proto=DOMParser.prototype,real_parseFromString=DOMParser_proto.parseFromString;try{if((new DOMParser).parseFromString("", "text/html"))return;}catch(e){}DOMParser_proto.parseFromString=function(markup,type){if(/^\s*text\/html\s*(;|$)/i.test(type)){var doc=document.implementation.createHTMLDocument(""),doc_elt=doc.documentElement,first_elt;doc_elt.innerHTML=markup;first_elt=doc_elt.firstElementChild;if (doc_elt.childElementCount===1&&first_elt.localName.toLowerCase()==="html")doc.replaceChild(first_elt,doc_elt);return doc;}else{return real_parseFromString.apply(this, arguments);}};}(DOMParser));
/*
* @description Validate a HTML string
* @param String html The HTML string to be validated
* @returns null If the string is not wellformed XML
* false If the string contains an unknown element
* true If the string satisfies both conditions
*/
function validateHTML(html) {
var parser = new DOMParser()
, d = parser.parseFromString('<?xml version="1.0"?>'+html,'text/xml')
, allnodes;
if (d.querySelector('parsererror')) {
console.log('Not welformed HTML (XML)!');
return null;
} else {
/* To use text/html, see https://stackoverflow.com/a/9251106/938089 */
d = parser.parseFromString(html, 'text/html');
allnodes = d.getElementsByTagName('*');
for (var i=allnodes.length-1; i>=0; i--) {
if (allnodes[i] instanceof HTMLUnknownElement) return false;
}
}
return true; /* The document is syntactically correct, all tags are closed */
}
console.log(validateHTML('<div>')); // null, because of the missing close tag
console.log(validateHTML('<x></x>'));// false, because it's not a HTML element
console.log(validateHTML('<a></a>'));// true, because the tag is closed,
// and the element is a HTML element
请参见此答案的版本1 ,以了解没有DOMParser的XML验证的替代方法.
See revision 1 of this answer for an alternative to XML validation without the DOMParser.
- 当前方法完全忽略了文档类型,以进行验证.
- 此方法在有效的HTML5(因为未关闭标记)的情况下,为
返回
null
.>- 未检查符合性.
这篇关于JavaScript中严格的HTML解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!