“&”符号仍然需要在HTML5中的URL中进行编码? [英] Do ampersands still need to be encoded in URLs in HTML5?

查看:111
本文介绍了“&”符号仍然需要在HTML5中的URL中进行编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近了解到(来自这些 问题)在某些时候,建议在 href 参数中编码&符号。也就是说,而不是写作:

 < a href =somepage.html?x = 1& y = 2 > ...< / A> 

应该写:

 < a href =somepage.html?x = 1& amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; org=y&y=2> ...< / a> 

显然,前面的例子不应该工作,但浏览器错误恢复意味着它的确如此。



HTML5中仍然存在这种情况吗?



现在我们已经过了苛刻的XHTML需求时代。这是XHTML严格处理的要求,还是我作为Web开发人员应该知道的呢?

解决方案

确实,HTML5和HTML4之间的区别之一,引自 W3C差异页,是:


与HTML4相比,更多情况下&符号可以保留未转义。


实际上,HTML5规范描述了确定消费(和解释)字符的实际算法。

特别是在关于在HTML5规范中第8章中对字符引用进行标记的部分,我们发现当你在一个属性中时,你会看到一个和号字符:




  • 一个标签,LF,FF,空格,< & ,EOF或其他允许的字符('如果引用属性值或> 如果没有)===>那么&符只是一个&符号,不用担心;

  • 一个数字符号===>然后HTML5标记生成器将通过许多步骤来确定它是否具有数字字符实体引用,但请注意,在这种情况下,一个是解析错误(请阅读规范)

  • 任何其他字符===>解析器将尝试查找命名字符引用,例如& notin;



最后一个案例是您感兴趣的案例,因为您的例子有:

 < a href = somepage.html X = 1和; Y = 2 > ...< / A> 

您有字符序列


  • AMPERSAND

  • LATIN SMALL LETTER Y

  • EQUAL SIGN



现在这里是HTML5规范中与您的情况相关的部分,因为 y 不是一个命名实体引用:


如果不能匹配,则不会消耗任何字符,也不会返回任何内容。在这种情况下,如果U + 0026 AMPERSAND字符(&)后面的字符由一个或多个字母数字ASCII字符和一个U + 003B SEMICOLON字符(;)组成,然后这是一个分析错误。 p>

您没有分号,所以您没有分析错误。



现在假设你有,而不是,

 < a href =somepage.html?x = 1& eacute = 2\" > ...< / A> 

这是不同的,因为& eacute; HTML中的命名实体引用。在这种情况下,以下规则开始:


如果字符引用正在作为属性的一部分使用,并且最后一个字符匹配不是一个 ; (U + 003B)字符,并且下一个字符是=(U + 003D)字符或字母数字ASCII字符,则由于历史原因,所有在U + 0026 AMPERSAND字符;)必须是未消耗的,并且不返回任何内容。但是,如果下一个字符实际上是一个=(U + 003D)字符,则这是一个解析错误,因为在这些情况下,某些旧版用户代理会误解标记。


因此, = 会导致错误,因为传统浏览器可能会感到困惑。



尽管HTML5规范似乎不遗余力地说好吧这个&符号没有开始一个字符实体引用所以这里没有引用这个事实你可能会遇到有命名引用的URL (例如, isin part sum sub )会导致解析错误,然后恕我直言你最好用它们。但是,当然,你只问过属性中的限制是否放松,而不是你应该做什么,而且看起来确实如此。



看到验证人可以做什么。


I learned recently (from these questions) that at some point it was advisable to encode ampersands in href parameters. That is to say, instead of writing:

<a href="somepage.html?x=1&y=2">...</a>

One should write:

<a href="somepage.html?x=1&amp;y=2">...</a>

Apparently, the former example shouldn't work, but browser error recovery means it does.

Is this still the case in HTML5?

We're now past the era of draconian XHTML requirements. Was this a requirement of XHTML's strict handling, or is it really still something that I should be aware of as a web developer?

解决方案

It is true that one of the differences between HTML5 and HTML4, quoted from the W3C Differences Page, is:

The ampersand (&) may be left unescaped in more cases compared to HTML4.

In fact, the HTML5 spec goes to great lengths describing actual algorithms that determine what it means to consume (and interpret) characters.

In particular, in the section on tokenizing character references from Chapter 8 in the HTML5 spec, we see that when you are inside an attribute, and you see an ampersand character that is followed by:

  • a tab, LF, FF, space, <, &, EOF, or the additional allowed character (a " or ' if the attribute value is quoted or a > if not) ===> then the ampersand is just an ampersand, no worries;
  • a number sign ===> then the HTML5 tokenizer will go through the many steps to determine if it has a numeric character entity reference or not, but note in this case one is subject to parse errors (do read the spec)
  • any other character ===> the parser will try to find a named character reference, e.g., something like &notin;.

The last case is the one of interest to you since your example has:

<a href="somepage.html?x=1&y=2">...</a>

You have the character sequence

  • AMPERSAND
  • LATIN SMALL LETTER Y
  • EQUAL SIGN

Now here is the part from the HTML5 spec that is relevant in your case, because y is not a named entity reference:

If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error.

You don't have a semicolon there, so you don't have a parse error.

Now suppose you had, instead,

<a href="somepage.html?x=1&eacute=2">...</a>

which is different because &eacute; is a named entity reference in HTML. In this case, the following rule kicks in:

If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.

So there the = makes it an error, because legacy browsers might get confused.

Despite the fact the HTML5 spec seems to go to great lengths to say "well this ampersand is not beginning a character entity reference so there's no reference here" the fact that you might run into URLs that have named references (e.g., isin, part, sum, sub) which would result in parse errors, then IMHO you're better off with them. But of course, you only asked whether restrictions were relaxed in attributes, not what you should do, and it does appear that they have been.

It would be interesting to see what validators can do.

这篇关于“&”符号仍然需要在HTML5中的URL中进行编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆