什么是HTML转义上下文? [英] What are all the HTML escaping contexts?

查看:94
本文介绍了什么是HTML转义上下文?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当输出HTML时,有几个不同的地方可以将文本解释为控制字符而不是文本文字。例如,在常规文本(即,任何元素标记之外):

 < div>这是常规文本< ; / DIV> 

除了属性值:

 < input value =this is value text> 

而且,我相信在HTML评论中:

 <! - 这里的文本可能以编程方式生成
,理论上可以包含双连字符
序列,这是verboten里面评论 - >

这三种文本中的每一种都有不同的规则,逃脱,以被视为非标记。所以我的第一个问题是HTML中有哪些其他的上下文可以被解释为标记/控制字符?上述情况对于需要转义的内容显然有不同的规则。



第二个问题是,什么是规范的,全球安全的字符列表(对于每个上下文)需要转义以确保任何嵌入的文本被视为非标记?例如,理论上你只需要在属性值中转义'和,因为属性值中的只有关闭分隔符('或取决于属性值以哪个分隔符开始)有控制意义。类似地,在常规文本中,和&有控制意义。 (我意识到并不是所有的HTML解析器都是一样的,我最感兴趣的是需要转义的最小字符集,以便安抚一个符合规范的解析器。)



切入点:以下文本将以HTML 4.01严格错误:

 < a href =http:// example.com/file.php?x=1&y=2\">foo</a> 

具体来说,它不知道实体& y应该是什么是。如果你放了一个空格,然后,它验证很好。但是,如果你是在飞行中生成的,你可能不会想检查每个使用的&将导致验证错误,而只是逃避所有&内部属性值。

解决方案

 < div>这是普通文本< / div> 

文字内容:& 必须转义。



如果以非UTF编码生成文档,则不适合的字符必须转义所选择的编码内容。



在XHTML(通常为XML)中,序列]]> 不得在文本内容中出现,因此在特定情况下,该顺序中的一个字符必须被转义,传统上是> 。为了一致性,Canonical XML规范选择在文本内容中每次转义> ,这对于转义函数来说不是一个坏策略,尽管你可以跳过它 - 创作。

 < input value =this is value text> 

属性值:& 必须被转义。必须转义属性值分隔符'如果没有使用属性值分隔符(不要这样做) )不可以逃避。



规范XML总是选择作为分隔符,因此转义它。不需要在属性值中转义> 字符,而Canonical XML不会转义。 HTML4规范建议编码> 反向兼容性,但这只影响了几个真正古老而可怕的浏览器,现在没有人记得;你可以忽略它。



在XHTML中 必须被转义。虽然你可以避免在HTML4中逃脱它,但这不是一个好主意。



要在属性值中包含选项卡,CR或LF(没有将它们变成plain空格由属性值归一化算法),您必须将其编码为字符引用。



对于文本内容和属性值:在XML中的XHTML 1.1,您必须退出限制字符,即删除字符和C0和C1控制代码,减号标签,CR,LF和NEL。总共 [\x01-\x08\x0B\x0C\x0E-\x1F\x7F- \x84\x86-\x9F] 。空字符甚至可能不包括在XML 1.1中转义。外部XML 1.1你根本不能使用任何这些字符,也没有什么好的理由你想要的。

 <! - 这里的文本可能以编程方式生成
,理论上可以包含双连字符
序列,这是verboten里面的注释 - >

是的,但是由于内部注释中没有可以转义的内容,您无法做到这一点。如果你写<! - & lt; - > ,它实际上是指包含符号字母l字母t分号的评论,并将在DOM或其​​他信息集中反映出来。包含 - 的评论根本无法序列化。



<![ XML中的CDATA [<?pi s也不能使用转义。串行一个包含]]> 序列的CDATA节的传统解决方案是将该序列分成两个CDATA段,以使其不会一起出现。您不能在单个CDATA部分中进行序列化,您不能在数据中使用?> 序列化PI。



HTML中的CDATA元素,如< script> < style> )可能不包含< / (ETAGO)序列,因为这会使元素提前结束,然后如果不是end-tag-name后面则会出错。由于CDATA元素中不可以进行转义,所以必须避免这个顺序(例如通过将 document.write('< / p>') code> document.write('< \ / p>'); (你看到很多更复杂的愚蠢策略来解决这个问题,比如调用 unescape 在JS - %编码的字符串上;甚至通常'< / scr'+'ipt>'这仍然是无效的。)



HTML和XML中还有一个上下文,其中适用不同的规则,并且在DTD中(包括DOCTYPE声明中的内部子集,如果您有一个),其中字符具有特殊权限,并且需要被转义为字面上的使用,但作为HTML文档作者,您不可能再需要去任何地方

 以下文本将以HTML 4.01严格错误:

< a href =http://example.com/file.php?x=1&y=2> foo< / a>

是的,这在Transitional中也是一个错误。

 如果在& amp之后放置一个空格,然而,它验证的很好。是的,根据SGML规则,除了 [A-Za-z]   

/ code>和不会开始解析作为参考。不是依靠这个好主意。 (当然,XHTML的格式不是很好)。


When outputting HTML, there are several different places where text can be interpreted as control characters rather than as text literals. For example, in "regular" text (that is, outside any element markup):

<div>This is regular text</div>

As well as within the values of attributes:

<input value="this is value text">

And, I believe, within HTML comments:

<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->

Each of these three kinds of text has different rules for how it must be escaped in order to be treated as non-markup. So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters? The above contexts clearly have different rules about what needs to be escaped.

The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup? For example, in theory you only need to escape ' and " in attribute values, since within an attribute value only the closing-delimiter character (' or " depending on which delimiter the attribute value started with) would have control meaning. Similarly, within "regular" text only < and & have control meaning. (I realize that not all HTML parsers are identical. I'm mostly interested in what is the minimum set of characters that need escaping in order to appease a spec-conforming parser.)

Tangentially: The following text will throw errors as HTML 4.01 Strict:

<a href="http://example.com/file.php?x=1&y=2">foo</a>

Specifically, it says that it doesn't know what the entity "&y" is supposed to be. If you put a space after the &, however, it validates just fine. But if you're generating this on the fly, you're probably not going to want to check whether each use of & will cause a validation error, and instead just escape all & inside attribute values.

解决方案

<div>This is regular text</div>

Text content: & must be escaped. < must be escaped.

If producing a document in a non-UTF encoding, characters that do not fit inside the chosen encoding must be escaped.

In XHTML (and XML in general), the sequence ]]> must not occur in text content, so in that specific case one of the characters in that sequence must be escaped, traditionally the >. For consistency, the Canonical XML specification chooses to escape > every time in text content, which is not a bad strategy for an escaping function, though you can certainly skip it for hand-authoring.

<input value="this is value text">

Attribute values: & must be escaped. The attribute value delimiter " or ' must be escaped. If no attribute value delimiter is used (don't do that) no escape is possible.

Canonical XML always chooses " as the delimiter and therefore escapes it. The > character does not need to be escaped in attribute values and Canonical XML does not. The HTML4 spec suggested encoding > anyway for backwards compatibility, but this affects only a few truly ancient and dreadful browsers that no-one remembers now; you can ignore that.

In XHTML < must be escaped. Whilst you can get away with not escaping it in HTML4, it's not a good idea.

To include tabs, CR or LF in attribute values (without them being turned into plain spaces by the attribute value normalisation algorithm) you must encode them as character references.

For both text content and attribute values: in XHTML under XML 1.1, you must escape the Restricted Characters, which are the Delete character and C0 and C1 control codes, minus tab, CR, LF and NEL. In total, [\x01-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]. The null character may not be included at all even escaped in XML 1.1. Outside XML 1.1 you can't use any of these characters at all, nor is there a good reason you'd ever want to.

<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->

Yes, but since there is no escaping possible inside comments, there is nothing you can do about it. If you write <!-- &lt; -->, it literally means a comment containing "ampersand-letter l-letter t-semicolon" and will be reflected as such in the DOM or other infoset. A comment containing -- simply cannot be serialised at all.

<![CDATA[ sections and <?pi​s in XML also cannot use escaping. The traditional solution to serialise a CDATA section including a ]]> sequence is to split that sequence over two CDATA sections so it doesn't occur together. You can't serialise it in a single CDATA section, and you can't serialise a PI with ?> in the data.

CDATA-elements like <script> and <style> in HTML (not XHTML) may not contain the </ (ETAGO) sequence as this would end the element early and then error if not followed by the end-tag-name. Since no escaping is possible within CDATA-elements, this sequence must be avoided and worked around (eg. by turning document.write('</p>') into document.write('<\/p>');. (You see a lot of more complicated silly strategies to get around this one, like calling unescape on a JS-%-encoded string; even often '</scr'+'ipt>' which is still quite invalid.)

There is one more context in HTML and XML where different rules apply, and that's in the DTD (including the internal subset in the DOCTYPE declaration, if you have one), where the % character has Special Powers and would need to be escaped to be used literally. But as an HTML document author it is highly unlikely you would ever need to go anywhere near that whole mess.

The following text will throw errors as HTML 4.01 Strict:

<a href="http://example.com/file.php?x=1&y=2">foo</a>

Yes, and it's just as much an error in Transitional.

If you put a space after the &, however, it validates just fine. 

Yes, under SGML rules anything but [A-Za-z] and # doesn't start parsing as a reference. Not a good idea to rely on this though. (Of course, it's not well-formed in XHTML.)

这篇关于什么是HTML转义上下文?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆