为什么美元符号不再“仅用于机械生成的代码?” [英] Why is the dollar sign no longer "intended for use only in mechanically generated code?"

查看:128
本文介绍了为什么美元符号不再“仅用于机械生成的代码?”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ECMA-262,第3版 [PDF] ,在7.6节(标识符,第26页)下,我们看到以下注释:


美元符号仅用于机械生成的代码。


这似乎合情合理。许多常用于生成或嵌入JavaScript的语言对 $ 具有特殊含义,并且在这些语言中的JavaScript标识符中使用它会导致意外行为



机械生成的条款出现在第2版中。在第1版中,它没有出现。从版本5开始,它在没有解释的情况下再次消失,并且仍然是缺席来自第6版的工作草案。



如果我不得不猜测,我认为它最初被省略,因为潜在的陷阱不是考虑过,然后在下一版中加入,当它变得很明显它引起了问题时。不过,我想不出在第5版中再次删除它的充分理由。



是否有任何解释包含并随后删除机械生成的条款 来自规范(来自邮件列表,新闻组或其他地方的文件记录)?我无法在任何地方找到这个记录。






作为一个附带问题,任何人都可以解释包括零宽度字符?这似乎会导致更多麻烦,因为你根本看不到这些字符,我想不出你想要这些字符在标识符中的任何理由。






更新:在代码交换的答案中解释了机械生成代码注释的初始包含以及零宽度字符的包含下面。唯一需要回答的是这个问题的主要焦点,即机械生成代码注释的删除

解决方案

这是一个开始:主题:SC22 N2745 - 处置评论报告DIS 16262 -ECMAScript



似乎只应用于机械生成的代码,因为这是JAVA的规范。


D6)7.5:DOLLAR SIGN不应该在标识符列表中,根据TR 10176中的建议.7.5应该参考i18n ISO / IEC 14652关于字母和数字定义的规范。



>>>>>> 行动:部分接受--- ECMAScript遵循Java先例。注释将添加$仅应用于机械生成的代码。 <<<<<


如果您想要浏览过去会议的会议记录,可以看一下:

ecmascript wiki:过去会议的注释和会议记录






关于以后的更改:

所有这些都来自邮件列表 es5- discuss-讨论ECMAScript 3.x



ZWNJ和ZWJ的标识符(是:评论4月ES5最终草案标准tc39-2009-025)



John Cowan写道:


事实证明,Unicode 5.1已经完成了繁重的工作:坏消息
是提升确实很重。你想要允许Cf字符
当且仅当它们实际上在
当代使用中进行语义区分时才允许。 Unicode 5.1说,原来只允许
U + 200C和U + 200D,然后只在某些情况下:规则涉及
,知道附近标识符
字符的Script和Joining_Type属性。
的详细信息 http://unicode.org/reports/tr31/#Layout_and_Format_Control_Characters


David-Sarah Hopwood回复:


简单地将U + 200C和U + 200D添加到
IdentifierPart而没有任何其他上下文相关规则的缺点是什么?



我认为输入法和
程序员的共同责任是确保< ZWNJ> < ZWJ> 字符在标识符中按预期使用
;编程语言语法需要做的只是允许它们。



请注意在没有
可见区别结果的情况下排除尽可能多的情况的目标(据说出于安全原因)并非
真正适用,因为ECMAScript不会强制执行甚至NFC
规范化。如UTR#31所暗示的那样,为了防止某些
潜在(但相对无害的,AFAICS)误用<而不是强制执行NFC而是为语法增加相当大的复杂性
。 ZWNJ>

< ZWJ> ,对我来说似乎是一组不一致的设计选择。







这一起引发了大量讨论:最后征集关于格式控制字符的共识。问题



对此有15条回复,您可能希望通读这些内容:

https://mail.mozilla.org/pipermail/es5-discuss/2009-June /thread.html#2832



Allen Wirfs-Brock写道:


来自5月F2F的Waldemar的笔记没有记录关于< ZWNJ> < ZWJ>的
问题的任何决定;
标识符。但是,我的个人笔记
说我需要保留标识符并修复语法,这也是
我对会议决定的回忆。



最简单的决定是简单地添加< ZWNJ>
< ZWJ> 作为IdentifierPart的替代品。另外,
第7.1节中的文字说明格式控制字符可以在
标识符中出现,大概需要缩小到只说< ZWNJ>
< ZWJ>



与F2F David-Sarah大致同时除了
寻址< ZWNJ> < ZWJ>之外,还提出了更多
全面的提案(下面重复)。
还显着改进了
< BOM> 的规则,包括将它们从字符串文字和常规
表达式中排除并制作它们< BOM> 的语法错误,在
内出现一个标识符。



我不是Unicode专家,但我的感觉是David-Sarah的建议
是合理的,可能与清理$ b $的最初目标一致b规范中的Cf类。但是,他对< BOM> 的规则也是
似乎可能会使词汇分析
阶段的实现大大复杂化。



我对F2F的感觉是,上述简单解决方案的
方向的共识更多(< ZWNJ> <标识符中的$ code>和< ZWJ> < BOM>
空格)而不是David-Sarah更全面地处理
< BOM>



我需要对此做出最终决定,以便我可以相应地更新
草案。基于我对F2F的回忆,我将使用简单解决方案获得
,除非明显达成共识
否则。



最终想法?


他回复的消息,根据消息引用分成块:


-----原始消息-----
来自:es5-discuss-bounces at mozilla.org [mailto:es5-discuss-
在mozilla.org上反弹]代表David-Sarah Hopwood
发送时间:2009年5月28日星期四下午5:44
收件人:es5-discuss at mozilla.org
主题:Grammar for IdentifierName不允许< ZWNJ> < ZWJ>



John Cowan写道:


David-Sarah Hopwood scripsit:


< IdentifierName>
中省略格式控制字符似乎是
只是一个疏忽。


-1



Break


事实上,我已经忘记了我们已经讨论了这个并且得出了不同的结论:



https://mail.mozilla。 org / pipermail / es5-discuss / 2009-April / 002432.html
https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002435.html


休息



允许所有这些问题导致同样的问题允许$​​ b $ b BOM。它们中的大多数对周围的文本
(特别是拉丁文脚本文本)几乎没有明显的影响,即使在完全符合的Unicode
渲染器中,
也不会介意使用它们的渲染器。结果是foobar

foo < Cf> bar看起来相同但不是。



根据Unicode 5.1,唯一实际影响自然 -
语言
标识符的含义是U + 200C ZWNJ和U + 200D ZWJ。这些是

,甚至应该在ES5标识符中考虑。 UAX#31
(在Unicode 5.1中通过引用包含
)规定了较窄的条件
,其中ZWNJ和ZWJ是必不可少的;坚持条件是
非平凡,但最大限度地减少了欺骗的可能性。



考虑到风险,我不确定ZWNJ和ZWJ是否应该被允许
与否。



休息


忘记尝试将标识符欺骗最小化为安全风险。如果要允许Unicode标识符,则不可能是
。这是Unicode的b $ b内在特性,许多不同的(即使是
规范化的)
字符串看起来都是一样的。一般情况下,这是一个
真正的
安全风险并不是很明显 - 与
需要对抗代码审查的情况相反,完整的ECMAScript需要很长时间
。能够支持。



尝试最小化的有用之处是意外地
输入不同但看起来相同的标识符的机会,或者看到
标识符并且无法可靠地重现它。这是
可用性
问题,而不是安全问题。



为了实用性,它可能确实是允许<$ c $的好方法c>< ZWNJ>
< ZWJ>
但不允许使用其他格式控制字符。我不太熟悉需要这些字符确保
的脚本,但根据他们在Unicode
标准中的描述看似合理。



但是,UAX#31中针对
描述的复杂的依赖于脚本的规则限制了< ZWNJ> 和<的上下文code>< ZWJ> 可能会发生,因为不可能防止欺骗,所以看起来非常优惠
。再次,请参阅
https://mail.mozilla。 org / pipermail / es5-discuss / 2009-April / 002435.html



将该帖子的提案与< NEL> ,
< ZWSP> < BOM> (因为两者都影响第7.1节),我们最终得到这个。



====
对第7.2节的更改:
- 还原添加< NEL> < ZWSP> < BOM> 到WhiteSpace,
到表。



对7.8.4节的更改:



DoubleStringCharacter ::
SourceCharacter但不是双引号或反斜杠\或
LineTerminator
< BOM>
\ EscapeSequence
LineContinuation



SingleStringCharacter ::
SourceCharacter但不是单q uote'或反斜杠\或
LineTerminator
< BOM>
\ EscapeSequence
LineContinuation



NonEscapeCharacter ::
SourceCharacter但不是EscapeCharacter或LineTerminator或< BOM>




  • DoubleStringCharacter的简历:: SourceCharacter但不是
    双引号或反斜杠\或LineTerminator或< BOM> ;
    是SourceCharacter字符本身


  • SingleStringCharacter :: SourceCharacter的CV,但不是
    single-引用'或反斜杠\或LineTerminator或< BOM>
    是SourceCharacter字符本身。


  • NonEscapeCharacter的简历:: SourceCharacter但不是
    EscapeCharacter或LineTerminator或< BOM>
    SourceCharacter字符本身。




替换第7.1节: / p>

7.1 Unicode格式控制字符



Unicode格式控制字符(即$ b $中的字符) b Unicode字符数据库中的常规类别Cf,例如
LEFT-TO-RIGHT MARK或RIGHT-TO-LEFT MARK)是用于
控制格式化一系列文本的控制代码没有
更高级别的协议,例如标记语言。



< BOM> 是一个格式控制字符,主要用于
a文本的开头,用于将其标记为Unicode,并允许检测文本的
编码和字节顺序。 < BOM> 用于此目的的字符
有时也会出现在文本开头之后,例如
a连接文件的结果。



在ECMAScript源代码中,< BOM> 如果字符在紧接之前或之后出现
,则会被忽略令牌,或在连续
WhiteSpace字符(7.2)的范围内。词汇语法没有明确
包括这样忽略的< BOM> 字符。在令牌中显示
< BOM> 字符的语法错误(即,如果删除
< BOM> 会导致前后字符为
同一令牌的一部分。)



注意评论不是令牌,因此上述规则允许
< BOM> 字符出现在评论中。它不允许
出现在字符串文字或正则表达式文字中(应该使用
转义序列\ uFEFF)。



允许源文本
中的其他格式控制字符以便于编辑和显示是很有用的。其他
以上的格式控制字符< BOM> 可以在注释,字符串文字和
正则表达式文字中使用。两个特定的格式控制字符,
< ZWNJ> < ZWJ> ,也可能是在第一个
字符后的标识符中使用。

 
代码单位值名称正式名称



\ u200C零宽度非连接器< ZWNJ>
\\\‍零宽度连接器< ZWJ>
\ uFEFF字节顺序标记(也称为
零-width non-breaking space)< BOM>

对7.6节的更改:



[...]此标准规定了特定的字符添加:
美元符号($)和下划线(_)允许在
中的任何位置使用标识符。第一个
字符后,允许< ZWNJ> < ZWJ>



对7.8.5节的更改:



RegularExpressionNonTerminator ::
SourceCharacter但不是LineTerminator或< BOM>



附件A的变更:
- 更新上述更改的所有作品。



对附件E的更改:
- 添加到第7.1节的条目:在令牌和注释中忽略
字符,
但不允许在令牌中(包括字符串和
正则表达式文字)。 < ZWNJ> < ZWJ> 在标识符中是重要的
而不是被剥离。




  • 删除第7.2和15.10.2.12节的条目。



    (还原< NEL> < ZWSP> < BOM> ;
    WhiteSpace产品还会为\s字符
    类恢复此值,而不会对第15.10.2.12节进行任何明确更改。)




-
David-Sarah Hopwood⚥ http://davidsarah.livejournal.com






es5-讨论邮件列表
es5 - 在mozilla.org上讨论
https://mail.mozilla.org/listinfo/es5 -discuss







我不打算全部拉这一起并给你一个简洁的答案,也许其他人会,你可以接受这个作为答案,看看这是一个起点。



最后一个链接:

2009年8月的档案有初步草稿和发布ES5的候选人1讨论。


In ECMA-262, 3rd edition[PDF], under section 7.6 ("Identifiers," page 26), we see the following note:

The dollar sign is intended for use only in mechanically generated code.

That seems reasonable. Many languages commonly used for generating or embedding JavaScript hold a special meaning for $, and using it in JavaScript identifiers within those languages leads to unexpected behavior.

The "mechanically generated clause" appeared in edition 2. In edition 1, it was not present. As of edition 5, it disappears again without explanation, and it remains absent from the working draft of the 6th edition.

If I had to guess, I'd assume it was originally omitted because the potential pitfalls hadn't been considered, and was then added in the next edition when it became clear that it was causing problems. I can't think of a good reason for removing it again in edition 5, though.

Is there any explanation for the inclusion and subsequent removal of the "mechanically generated clause" from the specification (a "paper trail" from mailing lists, newsgroups, or elsewhere)? I can't find this documented anywhere.


As a side question, can anyone explain the rationale behind including zero-width characters in the edition 6 draft? This seems like it will cause even more trouble, given that you can't see those characters at all, and I can't think of any reason you'd want those characters in an identifier.


Update: The initial inclusion of the "mechanically generated code" note and the inclusion of zero-width characters are explained in codewaggle's answer below. The only thing remaining to be answered is the primary focus of this question, the removal of the "mechanically generated code" note.

解决方案

Here's a start: Subject: SC22 N2745 - Disposition of Comments Report on DIS 16262 -ECMAScript

It appears that "should only be used for mechanically-generated code" was added because that was the spec for JAVA.

D6) 7.5: DOLLAR SIGN should not be in the identifier list, according to recommendations in TR 10176. 7.5 should refer to the "i18n" specification of ISO/IEC 14652 for definitions of letters and digits.

>>>>>> Action: Partial acceptance --- ECMAScript follows Java precedent. A comment will add that $ should only be used for mechanically-generated code. <<<<<

If you want to slog through the minutes of past meetings, you can look here:
ecmascript wiki: Notes and Minutes from past meetings


About later changes:
All of this is from the mailing list "es5-discuss -- Discussion of ECMAScript 3.x".

ZWNJ and ZWJ in identifiers (was: Comments on April ES5 final draft standard tc39-2009-025)

John Cowan wrote:

It turns out that Unicode 5.1 has done the heavy lifting: the bad news is that the lifting is indeed heavy. You want to allow Cf characters if and only if they actually make a semantic distinction in contemporary use. That turns out, says Unicode 5.1, to allow only U+200C and U+200D and then only in certain contexts: the rules involve knowing the Script and Joining_Type properties of nearby identifier characters. Details at http://unicode.org/reports/tr31/#Layout_and_Format_Control_Characters .

David-Sarah Hopwood replied:

What is the down-side of simply adding U+200C and U+200D to IdentifierPart without any additional context-sensitive rules?

I think that it is the combined responsibility of input methods and of programmers to ensure that <ZWNJ> and <ZWJ> characters are used as intended in identifiers; all that a programming language syntax needs to do is to allow them.

Note that the goal of "excluding as many cases as possible where no visible distinction results" (supposedly for security reasons) is not really applicable, since ECMAScript does not enforce even NFC normalization. To not enforce NFC but to add considerable complexity to the grammar, as UTR #31 suggests, in order to prevent some potential (but relatively harmless, AFAICS) misuses of <ZWNJ> and <ZWJ>, seems like an inconsistent set of design choices to me.


This one pulls a bunch of discussion together: Last call for consensus on format-control char. issues

There are 15 replies to this, you'll probably want to read through those:
https://mail.mozilla.org/pipermail/es5-discuss/2009-June/thread.html#2832

Allen Wirfs-Brock wrote:

Waldemar's notes from the May F2F don't record any decision on the issue of <ZWNJ> and <ZWJ> in identifiers. However, my personal notes say that I need to "keep in identifiers and fix grammar" which is also my recollection of what we decided at the meeting.

The simplest implementation of that decisions is to simply add <ZWNJ> and <ZWJ> as alternatives for IdentifierPart. In addition, the text in section 7.1 that says that format control characters can occur in identifier presumably needs to be narrowed to say only <ZWNJ> and <ZWJ>.

At about the same time as the F2F David-Sarah made a more comprehensive proposal (duplicated below) that in addition to addressing <ZWNJ> and <ZWJ> also significantly refines the rules for <BOM> including excluding them from strings literals and regular expressions and making it a syntax error for a <BOM> to appear within an identifier.

I'm not a Unicode expert, but my sense is that David-Sarah's proposal is sound and probably consistent with the original goals of cleaning up class Cf in the specification. However, his rules for <BOM> also seem like they could significantly complicate the lexical analysis phase of implementations.

My sense from the F2F is that the consensus was more in the direction of my simple solution above (<ZWNJ> and <ZWJ> in identifiers, <BOM> is whitespace) rather than David-Sarah's more comprehensive treatment of <BOM>.

I need to have a final decision on this so I can update the draft accordingly. Based upon my recollection of the F2F I'm going to go with the "simple solution" unless there is apparent consensus otherwise.

Final thoughts?

The message he replied to, broken into chunks based on the message quoting:

-----Original Message----- From: es5-discuss-bounces at mozilla.org [mailto:es5-discuss- bounces at mozilla.org] On Behalf Of David-Sarah Hopwood Sent: Thursday, May 28, 2009 5:44 PM To: es5-discuss at mozilla.org Subject: Grammar for IdentifierName does not allow <ZWNJ> and <ZWJ>

John Cowan wrote:

David-Sarah Hopwood scripsit:

The omission of format-control characters from <IdentifierName> appears to be just an oversight.

-1

Break

Indeed, I had forgotten that we had already discussed this and come to a different conclusion:

https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002432.html https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002435.html.

Break

Allowing all of them causes the same kinds of problems as allowing BOM. Most of them have little visible effect on the surrounding text (especially Latin-script text) even in fully conformant Unicode renderers, never mind renderers that muffle them. The result is that "foobar" and "foo<Cf>bar" look the same but aren't.

Per Unicode 5.1, the only ones that actually affect the natural- language meaning of identifiers are U+200C ZWNJ and U+200D ZWJ. These are the only ones which should even be considered in ES5 identifiers. UAX #31 (which is included by reference in Unicode 5.1) specifies narrower conditions in which ZWNJ and ZWJ are essential; sticking to the conditions is non-trivial, but minimizes the chance of spoofing.

Given the risks, I'm uncertain whether ZWNJ and ZWJ should be allowed or not.

Break

Forget trying to minimize identifier spoofing as a security risk. That's not possible, if Unicode identifiers are to be allowed at all. It is an inherent characteristic of Unicode that many distinct (even when normalized) strings will look the same. It is not at all clear that this is a genuine security risk for general programming -- as opposed to situations that require adversarial code review, which full ECMAScript is a long way from being able to support.

What is useful to attempt to minimize is the chance of accidentally typing identifiers that are distinct but look the same, or of seeing an identifier and being unable to reliably reproduce it. This is a usability issue, not a security issue.

For usability, it may indeed be a good approach to allow <ZWNJ> and <ZWJ> but disallow other format-control characters. I am not sufficiently familiar with the scripts that require these characters to be sure of that, but it seems reasonable based on their descriptions in the Unicode standard.

However, the complicated script-dependent rules described in UAX #31 for restricting the contexts in which <ZWNJ> and <ZWJ> can occur, seem quite over-the-top given the impossibility of preventing spoofing. Again, see https://mail.mozilla.org/pipermail/es5-discuss/2009-April/002435.html.

Combining the proposal from that post with the changes for <NEL>, <ZWSP> and <BOM> (since both affect section 7.1), we end up with this.

==== Changes to section 7.2: - revert the addition of <NEL>, <ZWSP>, and <BOM> to WhiteSpace and to the table.

Changes to section 7.8.4:

DoubleStringCharacter :: SourceCharacter but not double-quote " or backslash \ or LineTerminator or <BOM> \ EscapeSequence LineContinuation

SingleStringCharacter :: SourceCharacter but not single-quote ' or backslash \ or LineTerminator or <BOM> \ EscapeSequence LineContinuation

NonEscapeCharacter :: SourceCharacter but not EscapeCharacter or LineTerminator or <BOM>

  • The CV of DoubleStringCharacter :: SourceCharacter but not double-quote " or backslash \ or LineTerminator or <BOM> is the SourceCharacter character itself

  • The CV of SingleStringCharacter :: SourceCharacter but not single-quote ' or backslash \ or LineTerminator or <BOM> is the SourceCharacter character itself.

  • The CV of NonEscapeCharacter :: SourceCharacter but not EscapeCharacter or LineTerminator or <BOM> is the SourceCharacter character itself.

Replace section 7.1:

7.1 Unicode Format-Control Characters

The Unicode format-control characters (i.e., the characters in General Category "Cf" in the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to control the formatting of a range of text in the absence of higher-level protocols for this, such as mark-up languages.

<BOM> is a format-control character used primarily at the start of a text to mark it as Unicode and to allow detection of the text's encoding and byte order. <BOM> characters intended for this purpose can sometimes also appear after the start of a text, for example as a result of concatenating files.

In ECMAScript source, <BOM> characters are ignored if they appear immediately before or after a token, or within a span of consecutive WhiteSpace characters (7.2). The lexical grammar does not explicitly include such ignored <BOM> characters. It is a syntax error for a <BOM> character to appear within a token (that is, if removing the <BOM> would result in the preceding and following characters being part of the same token).

Note that comments are not tokens, and so the above rule allows <BOM> characters to appear within comments. It does not allow them to appear within string literals or regular expression literals (the escape sequence \uFEFF should be used instead).

It is useful to allow other format-control characters in source text to facilitate editing and display. Format-control characters other than <BOM> may be used within comments, string literals, and regular expression literals. Two specific format-control characters, <ZWNJ> and <ZWJ>, may also be used in an identifier after the first character.

  Code Unit Value    Name                                Formal name

\u200C Zero width non-joiner <ZWNJ> \u200D Zero width joiner <ZWJ> \uFEFF Byte order mark (also called zero-width non-breaking space) <BOM>

Changes to section 7.6:

[...] This standard specifies specific character additions: The dollar sign ($) and the underscore (_) are permitted anywhere in an identifier. <ZWNJ> and <ZWJ> are permitted after the first character.

Changes to section 7.8.5:

RegularExpressionNonTerminator :: SourceCharacter but not LineTerminator or <BOM>

Changes to Annex A: - update all productions changed above.

Changes to Annex E: - add to the entry for section 7.1: characters are ignored between tokens and in comments, but are not allowed within tokens (including string and regular expression literals). <ZWNJ> and <ZWJ> are significant within identifiers rather than being stripped.

  • delete the entries for sections 7.2 and 15.10.2.12.

    (Reverting the additions of <NEL>, <ZWSP>, and <BOM> to the WhiteSpace production also reverts this for the \s character class, without any explicit change to section 15.10.2.12.)

-- David-Sarah Hopwood ⚥ http://davidsarah.livejournal.com


es5-discuss mailing list es5-discuss at mozilla.org https://mail.mozilla.org/listinfo/es5-discuss


I'm not going to try to pull all this together and give you a succinct answer, maybe someone else will and you can can accept that as the answer, look at this as a starting point.

One last link:
The August 2009 archive has the initial draft and release candidate 1 discussions for ES5.

这篇关于为什么美元符号不再“仅用于机械生成的代码?”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆