JavaScript中严格的HTML解析 [英] Strict HTML parsing in JavaScript

查看:39
本文介绍了JavaScript中严格的HTML解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Google Chrome(金丝雀)上,似乎没有字符串可以使DOM解析器失败.我正在尝试解析一些HTML,但是如果HTML并非完全100%有效,我希望它显示错误.我已经尝试了显而易见的方法:

On Google Chrome (Canary), it seems no string can make the DOM parser fail. I'm trying to parse some HTML, but if the HTML isn't completely, 100%, valid, I want it to display an error. I've tried the obvious:

var newElement = document.createElement('div');
newElement.innerHTML = someMarkup; // Might fail on IE, never on Chrome.

我还尝试了此问题中的方法.不会因为无效标记而失败,即使是我可以产生的最无效的标记.

I've also tried the method in this question. Doesn't fail for invalid markup, even the most invalid markup I can produce.

那么,至少有某种方法可以在Google Chrome浏览器中严格"解析HTML吗?我不想自己动手或使用外部验证实用程序对它进行标记.如果没有其他选择,则可以使用严格的XML解析器,但是某些元素不需要HTML中的结束标记,而且最好不要失败.

So, is there some way to parse HTML "strictly" in Google Chrome at least? I don't want to resort to tokenizing it myself or using an external validation utility. If there's no other alternative, a strict XML parser is fine, but certain elements don't require closing tags in HTML, and preferably those shouldn't fail.

推荐答案

使用 DOMParser 分两个步骤检查文档:

Use the DOMParser to check a document in two steps:

  1. 通过将其解析为XML来验证文档是否符合XML.
  2. 将字符串解析为HTML.这需要对DOMParser进行修改.
    遍历每个元素,并检查DOM元素是否为 HTMLUnknownElement 的实例.为此, getElementsByTagName('*')非常适合.
    (如果要严格解析文档,则必须递归遍历每个元素,并记住该元素是否为
  1. Validate whether the document is XML-conforming, by parsing it as XML.
  2. Parse the string as HTML. This requires a modification on the DOMParser.
    Loop through each element, and check whether the DOM element is an instance of HTMLUnknownElement. For this purpose, getElementsByTagName('*') fits well.
    (If you want to strictly parse the document, you have to recursively loop through each element, and remember whether the element is allowed to be placed at that location. Eg. <area> in <map>)

演示: http://jsfiddle.net/q66Ep/1/

/* DOM parser for text/html, see https://stackoverflow.com/a/9251106/938089 */
;(function(DOMParser) {"use strict";var DOMParser_proto=DOMParser.prototype,real_parseFromString=DOMParser_proto.parseFromString;try{if((new DOMParser).parseFromString("", "text/html"))return;}catch(e){}DOMParser_proto.parseFromString=function(markup,type){if(/^\s*text\/html\s*(;|$)/i.test(type)){var doc=document.implementation.createHTMLDocument(""),doc_elt=doc.documentElement,first_elt;doc_elt.innerHTML=markup;first_elt=doc_elt.firstElementChild;if (doc_elt.childElementCount===1&&first_elt.localName.toLowerCase()==="html")doc.replaceChild(first_elt,doc_elt);return doc;}else{return real_parseFromString.apply(this, arguments);}};}(DOMParser));

/*
 * @description              Validate a HTML string
 * @param       String html  The HTML string to be validated 
 * @returns            null  If the string is not wellformed XML
 *                    false  If the string contains an unknown element
 *                     true  If the string satisfies both conditions
 */
function validateHTML(html) {
    var parser = new DOMParser()
      , d = parser.parseFromString('<?xml version="1.0"?>'+html,'text/xml')
      , allnodes;
    if (d.querySelector('parsererror')) {
        console.log('Not welformed HTML (XML)!');
        return null;
    } else {
        /* To use text/html, see https://stackoverflow.com/a/9251106/938089 */
        d = parser.parseFromString(html, 'text/html');
        allnodes = d.getElementsByTagName('*');
        for (var i=allnodes.length-1; i>=0; i--) {
            if (allnodes[i] instanceof HTMLUnknownElement) return false;
        }
    }
    return true; /* The document is syntactically correct, all tags are closed */
}

console.log(validateHTML('<div>'));  //  null, because of the missing close tag
console.log(validateHTML('<x></x>'));// false, because it's not a HTML element
console.log(validateHTML('<a></a>'));//  true, because the tag is closed,
                                     //       and the element is a HTML element

请参见此答案的版本1 ,以了解没有DOMParser的XML验证的替代方法.

See revision 1 of this answer for an alternative to XML validation without the DOMParser.

  • 当前方法完全忽略了文档类型,以进行验证.
  • 此方法在有效的HTML5(因为未关闭标记)的情况下,为 返回 null .>
  • 未检查符合性.

这篇关于JavaScript中严格的HTML解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆