为什么“& reg”被呈现为“?”。没有边界分号 [英] Why is "&reg" being rendered as "®" without the bounding semicolon

查看:147
本文介绍了为什么“& reg”被呈现为“?”。没有边界分号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个问题,这个问题是通过我们的Google AdWords推动的营销活动揭示的。使用的标准参数之一是区域。当用户搜索并点击赞助商链接时,Google会生成一个长URL来跟踪点击,并在引荐者中发送大量内容。我们捕获了这些记录,并且我们注意到区域参数通过不正确。应该是什么

  http://ravercats.com/meow?foo=bar&region= catnip 

取而代之的是:

 http://ravercats.com/meow?foo=bar®ion=catnip

我已经验证过所有浏览器都会发生这种情况。这是我的理解 HTML实体语法被定义如下:

 & VALUE; 

其中前导边界是&符号,结束边界是分号。看起来很简单。问题是,这个问题并没有得到®实体的尊重,并且它在我们的系统中造成了各种各样的破坏。

有人知道为什么会发生这种情况吗?这是DTD中的错误吗? (我正在寻找目前的HTML DTD,看看我是否可以理解它)我试图找出在浏览器中常见的事情是什么使之成为现实,因此我在寻找DTD。



以下是您可以使用的证明。使用这段代码,制作一个HTML文件并将其呈现在浏览器中:

 < html> 
< a href =http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct> http://foo.com/栏富=栏&安培;区域= US&安培;寄存器=低通&安培; reg_test =失败&安培;商标=正确< / A>
< / html>

编辑:对于每个建议我需要转义整个网址的人,示例网址以上就是这个例子。真正的网址直接来自Google,我无法控制它的构建方式。这些建议虽然有效,但并未回答为什么会发生这种情况。

由于向后兼容的原因,现代浏览器的HTML解析器可以识别有效的字符引用在末尾始终具有分号,所以一些无分号的无效命名字符引用



要么你知道整个列表是什么,要么你遵循HTML5规则,当& 有效而没有被转义(例如,当),否则每当有疑问时总是将& 转义为& amp;



仅供参考,不带分号的已识别字符引用的完整列表为:

AElig,AMP,Aacute,Acirc ,Agrave,Aring,Atilde,AUML,COPY,Ccedil,ETH,Eacute,Ecirc,Egrave,Euml,GT,Iacute,Icirc,Igrave,IUML,LT,Ntilde,Oacute,Ocirc,Ograve,Oslash,Otilde,Ouml,QUOT ,REG,THORN, Uacute,Ucirc,Ugrave,Uuml,Yacute,aacute,ACIRC,急性,aelig,agrave,放大器,aring,atilde,AUML,brvbar,ccedil,¸,分,复制,CURREN,度,分,eacute,ecirc,egrave, ETH,euml,frac12,frac14,frac34,GT,iacute,icirc,iexcl,igrave,iquest,IUML,LAQUO,LT,MACR,微,middot,NBSP,否,ntilde,oacute,ocirc,ograve,ordf,ORDM, oslash,otilde,ouml,对,plusmn,磅,QUOT,RAQUO,REG,派别,害羞,SUP1,SUP2,SUP3,大街,刺,时间,uacute,ucirc,ugrave,UML,uuml,yacute,日元,yuml



但是,应该注意的是,只有在属性值中,如果下一个字符是a,则上述列表中的命名字符引用不会通过符合HTML5解析器进行处理 = 或字母数字ASCII字符。



带有或不带结尾分号的已命名字符引用的完整列表,请参阅此处


I've been running into a problem that was revealed through our Google adwords-driven marketing campaign. One of the standard parameters used is "region". When a user searches and clicks on a sponsored link, Google generates a long URL to track the click and sends a bunch of stuff along in the referrer. We capture this for our records, and we've noticed that the "Region" parameter is coming through incorrectly. What should be

http://ravercats.com/meow?foo=bar&region=catnip

is instead coming through as:

http://ravercats.com/meow?foo=bar®ion=catnip

I've verified that this occurs in all browsers. It's my understanding that HTML entity syntax is defined as follows:

&VALUE;

where the leading boundary is the ampersand and the closing boundary is the semicolon. Seems straightforward enough. The problem is that this isn't being respected for the ® entity, and it's wreaking all kinds of havoc throughout our system.

Does anyone know why this is occurring? Is it a bug in the DTD? (I'm looking for the current HTML DTD to see if I can make sense of it) I'm trying to figure out what would be common across browsers to make this happen, thus my looking for the DTD.

Here is a proof you can use. Take this code, make an HTML file out of it and render it in a browser:

<html>
<a href="http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct">http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct</a>
</html>

EDIT: To everyone who's suggesting that I need to escape the entire URL, the example URLs above are exactly that, examples. The real URL is coming directly from Google and I have no control over how it is constructed. These suggestions, while valid, don't answer the question: "Why is this happening".

解决方案

Although valid character references always have a semicolon at the end, some invalid named character references without a semicolon are, for backward compatibility reasons, recognised by modern browsers' HTML parsers.

Either you know what that entire list is, or you follow the HTML5 rules for when & is valid without being escaped (e,g, when followed by a space) or otherwise always escape & as &amp; whenever in doubt.

For reference, the full list of named character references that are recognised without a semicolon is:

AElig, AMP, Aacute, Acirc, Agrave, Aring, Atilde, Auml, COPY, Ccedil, ETH, Eacute, Ecirc, Egrave, Euml, GT, Iacute, Icirc, Igrave, Iuml, LT, Ntilde, Oacute, Ocirc, Ograve, Oslash, Otilde, Ouml, QUOT, REG, THORN, Uacute, Ucirc, Ugrave, Uuml, Yacute, aacute, acirc, acute, aelig, agrave, amp, aring, atilde, auml, brvbar, ccedil, cedil, cent, copy, curren, deg, divide, eacute, ecirc, egrave, eth, euml, frac12, frac14, frac34, gt, iacute, icirc, iexcl, igrave, iquest, iuml, laquo, lt, macr, micro, middot, nbsp, not, ntilde, oacute, ocirc, ograve, ordf, ordm, oslash, otilde, ouml, para, plusmn, pound, quot, raquo, reg, sect, shy, sup1, sup2, sup3, szlig, thorn, times, uacute, ucirc, ugrave, uml, uuml, yacute, yen, yuml

However, it should be noted that only when in an attribute value, named character references in the above list are not processed as such by conforming HTML5 parsers if the next character is a = or a alphanumeric ASCII character.

For the full list of named character references with or without ending semicolons, see here

这篇关于为什么“&amp; reg”被呈现为“?”。没有边界分号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆