如何转义HTML [英] How to escape HTML

查看:78
本文介绍了如何转义HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含HTML文本的字符串。我只需要转义字符串而不是标签。
例如,我的字符串包含

 < ul class =main_nav> 
< li>
< a class =className1id =idValue1tabindex =2> Test&样品LT; / A>
< / li>
< li>
< a class =className2id =idValue2tabindex =2> Test&样品2< / A>
< / li>
< / ul>

如何将文本转义为

 < ul class =main_nav> 
< li>
< a class =className1id =idValue1tabindex =2> Test& amp; amp;样品LT; / A>
< / li>
< li>
< a class =className2id =idValue2tabindex =2> Test& amp; amp; amp;样品2< / A>
< / li>
< / ul>

与修改标签。



这可以用HTML DOM和javascript来处理吗?



谢谢

解决方案

(见下文OP的评论更新的问题的答案)


可以这是用HTML DOM和javascript处理的?


不,一旦文本在DOM中,转义的概念就不不适用需要转义HTML 源文本,以便将其正确解析为DOM;一旦它在DOM中,它就不会被转义。



这可能有点难以理解,所以让我们用一个例子。以下是一些HTML 源文本(例如您可以在浏览器中查看的HTML文件中):

 < div>这个& amp; amp也就是说< / DIV> 

浏览器解析为DOM后,div中的文本为这个& ,因为& amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp;因此,您需要先阅读这些文字,然后才能通过浏览器将文本解析为DOM。事实上你不能处理它,太晚了。



另外,你开始的字符串是无效的,如果它有这样的东西,如< div>此&那个< / div> 里面。预处理无效字符串会很棘手。您不能仅仅使用您的环境的内置功能(PHP或任何您使用的服务器端),因为它们也将转义标签。您需要进行文本处理,仅提取要处理的部分,然后通过转义过程运行它们。这个过程会很棘手。一个& 后面是空格是很容易的,但是如果源文本中有未转义的实体,你怎么知道是否逃避它们?你假设如果字符串包含& amp; amp; amp; amp;< / code>,你会独自一人吗?或将其转换为 amp; amp; amp; ? (这是完全有效的;它是如何在HTML页面中显示实际的字符串& amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; >你真正需要做的是正确的基本问题:创建这些无效的半编码字符串的东西。






编辑:从我们下面的评论流中,问题与您的示例看起来完全不同(这不是批判性的)。要回顾一下这些新鲜的评论,你说你从WebKit的 innerHTML 获得这些字符串,我说这是奇怪的, innerHTML 应该正确地编码& (并指出你几个测试页的几个,建议它)。您的回复是:


这适用于&但同样的测试页不适用于像©®«还有更多。


这改变了问题的性质。你想要使字体的实体,虽然完全有效,当使用字面上(如果你有你的文本编码权利),可以表示为实体,因此更有弹性的文本编码更改。



我们可以这样做根据规范,JavaScript字符串中的字符值 UTF-16 (使用Unicode 规范化表格C )和源字符编码的任何转换( ISO 8859-1 Windows- 1252 ,UTF-8,无论什么)在JavaScript运行时看到它之前执行。 (如果你不是100%确定你知道我的意思是字符编码,现在值得停下来,去看看 绝对最小的每个软件开发人员绝对必须了解Joel Spolsky的Unicode和字符集(No Excuses!) ,然后回来。)那就是输入端。在输出端,HTML实体标识Unicode代码点。所以我们可以可靠地将JavaScript字符串转换为HTML实体。



尽管如此,恶魔在细节上是一样的。 JavaScript明确地假定每个16位值是一个字符(参见规范中的第8.4节),尽管UTF-16 —一个16位值可能是代理(例如0xD800),只有当与下一个值组合时才有意义,这意味着JavaScript字符串中的两个字符实际上是一个字符。对于远东语言来说,这并不罕见。



因此,以JavaScript字符串开头并导致HTML实体的 转换可以假设JavaScript角色实际上等于文本中的一个字符,它必须处理代理。幸运的是,这样做是很容易的,因为定义Unicode的智能人员使其变得容易:第一个代理值始终在0xD800-0xDBFF(含)范围内,第二个代理总是在0xDC00-0xDFFF(含)范围内。因此,任何时候,在与这些范围匹配的JavaScript字符串中看到一对字符,您正在处理由代理对定义的单个字符。在上述链接中给出了从一对替代值转换为码点值的公式,尽管相当钝;我发现这个页面更加平易近人。



拥有所有这些信息,我们可以编写一个函数,它将使用一个JavaScript字符串并搜索字符(真实字符,可能是一个或两个字符),你可能想要变成实体,如果我们在命名的地图中没有这些实体,则从地图或数字实体中将其替换为命名实体:

  /我们要处理的实体的地图。 
//左边的数字是Unicode代码点值;他们的
//匹配命名实体字符串在右边。
var entityMap = {
160:& nbsp;,
161:& iexcl;,
162:&
163:& #pound;,
164:& #curren;,
165:& #yen; ,
166:& #brvbar;,
167:& #sect;,
168:& #uml;
169:& copy;,
// ...和很多更多,请参见http://www.w3.org/TR/REC-html40/sgml/entities .html
8364:& euro; //最后一个不能有一个逗号,IE不喜欢尾随逗号
};

//执行工作的功能。
//接受一个字符串,返回一个带有替换的字符串。
function prepEntities(str){
//下面的正则表达式使用一个替换来查找代理对_or_
//我们可能想要使一个实体离开的单个字符。
//交替的第一部分(在|之前的[\\\�-\\\�] [\\\�-\\\�]),你想单独留下
//搜索代理。交替的第二部分,你可以根据你想要的保守程度来调整你的看法。下面的示例
//使用[\\\-\\\\\\€-\\\￿],这意味着它将匹配并转换任何
//字符,值为0到31( 控制字符)或以上127 - 例如,如果
//它不是可打印ASCII(以旧的说法),则将其转换。这可能是
// overkill,但是你说你想让实体摆脱困境,所以... :-)
return str.replace(/ [\\\�-\\\�] [\\ \\\\\FFFFFF] / g,function(match){
var high,low,charValue,rep

//获取字符值,处理代理对
if(match.length == 2){
//它是一个代理对,计算Unicode代码点
high = match。 charCodeAt(0) - 0xD800;
low = match.charCodeAt(1) - 0xDC00;
charValue =(high * 0x400)+ low + 0x10000;
}
else {
//不是代理对,值*是* Unicode代码点
charValue = match.charCodeAt(0);
}

//查看我们是否有一个映射为
rep = entityMap [charValue];
if(!rep){
//不,使用数字实体这里我们粗暴地(可能是错误的)
rep =&#+ charValue +;;
}

//返回替换
return rep;
});
}

你应该很好地传递所有的HTML,因为如果这些字符出现在属性值中,你几乎肯定要在那里编码它们。



我有使用上面的生产(我实际上写了为了这个答案,因为这个问题引起了我的兴趣),而且它完全没有提供任何形式的保证。我试图确保它处理代理对,因为这对远东语言是必需的,并且支持它们是我们现在应该做的事情,世界已经变小了。



完成示例页面:

 <!DOCTYPE HTML> 
< html>
< head>
< meta http-equiv =Content-typecontent =text / html; charset = UTF-8>
< title>测试页< / title>
< style type ='text / css'>
body {
font-family:sans-serif;
}
#log p {
margin:0;
padding:0;
}
< / style>
< script type ='text / javascript'>

//使该函数可用作全局函数,但在范围界定
//函数中定义它,因此我们可以拥有只能访问$ b的数据(entityMap) $ b var prepEntities =(function(){

//我们要处理的实体的地图
//左侧的数字是Unicode代码点值;它们的
//匹配命名实体字符串在右边
var entityMap = {
160:& nbsp;,
161:& iexcl; ,
162:& #cent;,
163:& #pound;,
164:& #curren;
165:& #yen;,
166:& #brvbar;,
167:& #sect;,
168:& #uml;,
169:& copy;,
// ...还有很多更多,请参见http:// www .w3.org / TR / REC-html40 / sgml / entities.html
8364:& euro;//最后一个不能有逗号,IE不喜欢尾部逗号
};

//执行工作的功能。
//接受一个字符串,返回一个带有替换的字符串。
function prepEntities(str){
//下面的正则表达式使用一个替换来查找代理对_or_
//我们可能想要使一个实体离开的单个字符。
//交替的第一部分(在|之前的[\\\�-\\\�] [\\\�-\\\�]),你想单独留下
//搜索代理。交替的第二部分,你可以根据你想要的保守程度来调整你的看法。下面的示例
//使用[\\\-\\\\\\€-\\\￿],这意味着它将匹配并转换任何
//字符,值为0到31( 控制字符)或以上127 - 例如,如果
//它不是可打印ASCII(以旧的说法),则将其转换。这可能是
// overkill,但是你说你想让实体摆脱困境,所以... :-)
return str.replace(/ [\\\�-\\\�] [\\ \\\\\FFFFFF] / g,function(match){
var high,low,charValue,rep

//获取字符值,处理代理对
if(match.length == 2){
//它是一个代理对,计算Unicode代码点
high = match。 charCodeAt(0) - 0xD800;
low = match.charCodeAt(1) - 0xDC00;
charValue =(high * 0x400)+ low + 0x10000;
}
else {
//不是代理对,值*是* Unicode代码点
charValue = match.charCodeAt(0);
}

//查看我们是否有一个映射为
rep = entityMap [charValue];
if(!rep){
//否,使用数字实体。这里我们粗暴地(可能错误地)
rep =&#+ charValue +;;
}

//返回替换
return rep;
});
}

//返回范围函数的函数引用以发布它
return prepEntities;
})();

function go(){
var d = document.getElementById('d1');
var s = d.innerHTML;
alert(之前:+ s);
s = prepEntities(s);
alert(After:+ s);
}

< / script>
< / head>
< body>
< div id ='d1'>版权所有& copy;日元:日元西葫芦:& cedil;代理对:&#65536;< / div>
< input type ='button'id ='btnGo'value ='Go'onclick =return go();>
< / body>
< / html>

我已经将cedilla作为转换为数字实体而不是命名的例子(因为我从我的小例子图中离开了 cedil )。并注意到,由于JavaScript处理UTF-16的方式,最终的代理对在第一个警报中显示为两个字符。


I have a string which contains HTML text. I need to escape just the strings and not tags. For example, I have string which contains,

<ul class="main_nav">
<li>
<a class="className1" id="idValue1" tabindex="2">Test & Sample</a>
</li>
<li>
<a class="className2" id="idValue2" tabindex="2">Test & Sample2</a>
</li>
</ul>

How to escape just the text to,

<ul class="main_nav">
<li>
<a class="className1" id="idValue1" tabindex="2">Test &amp; Sample</a>
</li>
<li>
<a class="className2" id="idValue2" tabindex="2">Test &amp; Sample2</a>
</li>
</ul>

with out modifying the tags.

Can this be handled with HTML DOM and javascript?

Thanks

解决方案

(See further down for an answer to the question as updated by comments from the OP below)

Can this be handled with HTML DOM and javascript?

No, once the text is in the DOM, the concept of "escaping" it doesn't apply. The HTML source text needs to be escaped so that it's parsed into the DOM correctly; once it's in the DOM, it isn't escaped.

This can be a bit tricky to understand, so let's use an example. Here's some HTML source text (such as in an HTML file that you would view with your browser):

<div>This &amp; That</div>

Once that's parsed into the DOM by the browser, the text within the div is This & That, because the &amp; has been interpreted at that point.

So you'll need to catch this earlier, before the text is parsed into the DOM by the browser. You can't handle it after the fact, it's too late.

Separately, the string you're starting with is invalid if it has things like <div>This & That</div> in it. Pre-processing that invalid string will be tricky. You can't just use built-in features of your environment (PHP or whatever you're using server-side) because they'll escape the tags as well. You'll need to do text processing, extracting only the parts that you want to process and then running those through an escaping process. That process will be tricky. An & followed by whitespace is easy enough, but if there are unescaped entities in the source text, how do you know whether to escape them or not? Do you assume that if the string contains &amp;, you leave it alone? Or turn it into &amp;amp;? (Which is perfectly valid; it's how you show the actual string &amp; in an HTML page.)

What you really need to do is correct the underlying problem: The thing creating these invalid, half-encoded strings.


Edit: From our comment stream below, the question is totally different than it seemed from your example (that's not meant critically). To recap the comments for those coming to this fresh, you said that you were getting these strings from WebKit's innerHTML, and I said that was odd, innerHTML should encode & correctly (and pointed you at a couple of test pages that suggested it did). Your reply was:

This works for &. But the same test page do not work for entities like ©, ®, « and many more.

That changes the nature of the question. You want to make entities out of characters that, while perfectly valid when used literally (provided you have your text encoding right), could be expressed as entities instead and therefore made more resilient to text encoding changes.

We can do that. According to the spec, the character values in a JavaScript string are UTF-16 (using Unicode Normalized Form C) and any conversion from the source character encoding (ISO 8859-1, Windows-1252, UTF-8, whatever) is performed before the JavaScript runtime sees it. (If you're not 100% sure you know what I mean by character encoding, it's well worth stopping now, going off and reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, then coming back.) So that's the input side. On the output side, HTML entities identify Unicode code points. So we can convert from JavaScript strings to HTML entities reliably.

The devil is in the detail, though, as always. JavaScript explicitly assumes that each 16-bit value is a character (see section 8.4 in the spec), even though that's not actually true of UTF-16 — one 16-bit value might be a "surrogate" (such as 0xD800) that only makes sense when combined with the next value, meaning that two "characters" in the JavaScript string are actually one character. This isn't uncommon for far Eastern languages.

So a robust conversion that starts with a JavaScript string and results in an HTML entity can't assume that a JavaScript "character" actually equals a character in the text, it has to handle surrogates. Fortunately, doing so is dead easy because the smart people defining Unicode made it dead easy: The first surrogate value is always in the range 0xD800-0xDBFF (inclusive), and the second surrogate is always in the range 0xDC00-0xDFFF (inclusive). So any time you see a pair of "characters" in a JavaScript string that match those ranges, you're dealing with a single character defined by a surrogate pair. The formulae for converting from the pair of surrogate values to a code point value are given in the above links, although fairly obtusely; I find this page much more approachable.

Armed with all of this information, we can write a function that will take a JavaScript string and search for characters (real characters, which may be one or two "characters" long) you might want to turn into entities, replacing them with named entities from a map or numeric entities if we don't have them in our named map:

// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
    "160": "&nbsp;",
    "161": "&iexcl;",
    "162": "&#cent;",
    "163": "&#pound;",
    "164": "&#curren;",
    "165": "&#yen;",
    "166": "&#brvbar;",
    "167": "&#sect;",
    "168": "&#uml;",
    "169": "&copy;",
    // ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
    "8364": "&euro;"    // Last one must not have a comma after it, IE doesn't like trailing commas
};

// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
    // The regular expression below uses an alternation to look for a surrogate pair _or_
    // a single character that we might want to make an entity out of. The first part of the
    // alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
    // alone, it searches for the surrogates. The second part of the alternation you can
    // adjust as you see fit, depending on how conservative you want to be. The example
    // below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
    // character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
    // it's not "printable ASCII" (in the old parlance), convert it. That's probably
    // overkill, but you said you wanted to make entities out of things, so... :-)
    return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
        var high, low, charValue, rep

        // Get the character value, handling surrogate pairs
        if (match.length == 2) {
            // It's a surrogate pair, calculate the Unicode code point
            high = match.charCodeAt(0) - 0xD800;
            low  = match.charCodeAt(1) - 0xDC00;
            charValue = (high * 0x400) + low + 0x10000;
        }
        else {
            // Not a surrogate pair, the value *is* the Unicode code point
            charValue = match.charCodeAt(0);
        }

        // See if we have a mapping for it
        rep = entityMap[charValue];
        if (!rep) {
            // No, use a numeric entity. Here we brazenly (and possibly mistakenly)
            rep = "&#" + charValue + ";";
        }

        // Return replacement
        return rep;
    });
}

You should be fine passing all of the HTML through it, since if these characters appear in attribute values, you almost certainly want to encode them there as well.

I have not used the above in production (I actually wrote it for this answer, because the problem intrigued me) and it is totally supplied without warrantee of any kind. I have tried to ensure that it handles surrogate pairs because that's necessary for far Eastern languages, and supporting them is something we should all be doing now that the world has gotten smaller.

Complete example page:

<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Test Page</title>
<style type='text/css'>
body {
    font-family: sans-serif;
}
#log p {
    margin:     0;
    padding:    0;
}
</style>
<script type='text/javascript'>

// Make the function available as a global, but define it within a scoping
// function so we can have data (the `entityMap`) that only it has access to
var prepEntities = (function() {

    // A map of the entities we want to handle.
    // The numbers on the left are the Unicode code point values; their
    // matching named entity strings are on the right.
    var entityMap = {
        "160": "&nbsp;",
        "161": "&iexcl;",
        "162": "&#cent;",
        "163": "&#pound;",
        "164": "&#curren;",
        "165": "&#yen;",
        "166": "&#brvbar;",
        "167": "&#sect;",
        "168": "&#uml;",
        "169": "&copy;",
        // ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
        "8364": "&euro;"    // Last one must not have a comma after it, IE doesn't like trailing commas
    };

    // The function to do the work.
    // Accepts a string, returns a string with replacements made.
    function prepEntities(str) {
        // The regular expression below uses an alternation to look for a surrogate pair _or_
        // a single character that we might want to make an entity out of. The first part of the
        // alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
        // alone, it searches for the surrogates. The second part of the alternation you can
        // adjust as you see fit, depending on how conservative you want to be. The example
        // below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
        // character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
        // it's not "printable ASCII" (in the old parlance), convert it. That's probably
        // overkill, but you said you wanted to make entities out of things, so... :-)
        return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
            var high, low, charValue, rep

            // Get the character value, handling surrogate pairs
            if (match.length == 2) {
                // It's a surrogate pair, calculate the Unicode code point
                high = match.charCodeAt(0) - 0xD800;
                low  = match.charCodeAt(1) - 0xDC00;
                charValue = (high * 0x400) + low + 0x10000;
            }
            else {
                // Not a surrogate pair, the value *is* the Unicode code point
                charValue = match.charCodeAt(0);
            }

            // See if we have a mapping for it
            rep = entityMap[charValue];
            if (!rep) {
                // No, use a numeric entity. Here we brazenly (and possibly mistakenly)
                rep = "&#" + charValue + ";";
            }

            // Return replacement
            return rep;
        });
    }

    // Return the function reference out of the scoping function to publish it
    return prepEntities;
})();

function go() {
    var d = document.getElementById('d1');
    var s = d.innerHTML;
    alert("Before: " + s);
    s = prepEntities(s);
    alert("After: " + s);
}

</script>
</head>
<body>
<div id='d1'>Copyright: &copy; Yen: &yen; Cedilla: &cedil; Surrogate pair: &#65536;</div>
<input type='button' id='btnGo' value='Go' onclick="return go();">
</body>
</html>

There I've included the cedilla as an example of converting to a numeric entity rather than a named one (since I left cedil out of my very small example map). And note that the surrogate pair at the end shows up in the first alert as two "characters" because of the way JavaScript handles UTF-16.

这篇关于如何转义HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆