为什么Java允许控制字符在其标识符中? [英] Why does Java allow control characters in its identifiers?

查看:109
本文介绍了为什么Java允许控制字符在其标识符中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在精确探索Java标识符中允许哪些字符时,我偶然发现了一些非常好奇的东西,似乎几乎肯定是一个bug。

In exploring precisely which characters were permitted in Java identifiers, I have stumbled upon something so extremely curious that it seems nearly certain to be a bug.

我希望发现Java标识符符合要求它们以具有Unicode属性 ID_Start ,然后是具有 ID_Continue 属性的那些,并且为前导下划线和美元符号授予例外。事实证明并非如此,而且我发现与我听说过的普通标识符或其他任何其他想法极为不同。

I’d expected to find that Java identifiers conformed to the requirement that they start with characters that have the Unicode property ID_Start and are followed by those with the property ID_Continue, with an exception granted for leading underscores and for dollar signs. That did not prove to be the case, and what I found is at extreme variance with that or any other idea of a normal identifier that I have heard of.

考虑以下演示,证明Java标识符中允许使用ASCII ESC字符(八进制033):

Consider the following demonstration proving that an ASCII ESC character (octal 033) is permitted in Java identifiers:

$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: \033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[

尽管如此,情况甚至更糟。事实上,几乎无限糟糕。甚至允许NULL!还有数千个甚至不是标识符字符的其他代码点。我在Solaris,Linux和运行Darwin的Mac上测试了这个,并且都给出了相同的结果。

It’s even worse than that, though. Almost infinitely worse, in fact. Even NULLs are permitted! And thousands of other code points that are not even identifier characters. I have tested this on Solaris, Linux, and a Mac running Darwin, and all give the same results.

这是一个测试程序,它将显示Java非常不允许作为合法标识符名称的一部分的所有这些意外代码点。

Here is a test program that will show all these unexpected code points that Java quite outrageuosly allows as part of a legal identifier name.

#!/usr/bin/env perl
# 
# test-java-idchars - find which bogus code points Java allows in its identifiers
# 
#   usage: test-java-idchars [low high]
#   e.g.:  test-java-idchars 0 255
#
# Without arguments, tests Unicode code points
# from 0 .. 0x1000.  You may go further with a
# higher explicit argument.
#
# Produces a report at the end.
#
# You can ^C it prematurely to end the program then
# and get a report of its progress up to that point.
#
# Tom Christiansen
# tchrist@perl.com
# Sat Jan 29 10:41:09 MST 2011

use strict;
use warnings;

use encoding "Latin1";
use open IO => ":utf8";

use charnames ();

$| = 1;

my @legal;

my ($start, $stop) = (0, 0x1000);

if (@ARGV != 0) {
    if (@ARGV == 1) {
        for (($stop) = @ARGV) { 
            $_ = oct if /^0/;   # support 0OCTAL, 0xHEX, 0bBINARY
        }
    }
    elsif (@ARGV == 2) {
        for (($start, $stop) = @ARGV) { 
            $_ = oct if /^0/;
        }
    } 
    else {
        die "usage: $0 [ [start] stop ]\n";
    } 
} 

for my $cp ( $start .. $stop ) {
    my $char = chr($cp);

    next if $char =~ /[\s\w]/;

    my $type = "?";
    for ($char) {
        $type = "Letter"      if /\pL/;
        $type = "Mark"        if /\pM/;
        $type = "Number"      if /\pN/;
        $type = "Punctuation" if /\pP/;
        $type = "Symbol"      if /\pS/;
        $type = "Separator"   if /\pZ/;
        $type = "Control"     if /\pC/;
    } 
    my $name = $cp ? (charnames::viacode($cp) || "<missing>") : "NULL";
    next if $name eq "<missing>" && $cp > 0xFF;
    my $msg = sprintf("U+%04X %s", $cp, $name);
    print "testing \\p{$type} $msg...";
    open(TESTPROGRAM, ">:utf8", "testchar.java") || die $!;

print TESTPROGRAM <<"End_of_Java_Program";

public class testchar { 
    public static void main(String argv[]) { 
        String var_$char = "variable name ends in $msg";
        System.out.println(var_$char); 
    }
}

End_of_Java_Program

    close(TESTPROGRAM) || die $!;

    system q{
        ( javac -encoding UTF-8 testchar.java \
            && \
          java -Dfile.encoding=UTF-8 testchar | grep variable \
        ) >/dev/null 2>&1
    };

    push @legal, sprintf("U+%04X", $cp) if $? == 0;

    if ($? && $? < 128) {
        print "<interrupted>\n";
        exit;  # from a ^C
    } 

    printf "is %s in Java identifiers.\n",  
        ($? == 0) ? uc "legal" : "forbidden";

} 

END { 
    print "Legal but evil code points: @legal\n";
}

以下是仅在前33个代码点上运行该程序的示例既不是空格也不是标识符字符:

Here is a sample of running that program on just the first 33 code points that are neither whitespace nor identifier characters:

$ perl test-java-idchars 0 0x20
testing \p{Control} U+0000 NULL...is LEGAL in Java identifiers.
testing \p{Control} U+0001 START OF HEADING...is LEGAL in Java identifiers.
testing \p{Control} U+0002 START OF TEXT...is LEGAL in Java identifiers.
testing \p{Control} U+0003 END OF TEXT...is LEGAL in Java identifiers.
testing \p{Control} U+0004 END OF TRANSMISSION...is LEGAL in Java identifiers.
testing \p{Control} U+0005 ENQUIRY...is LEGAL in Java identifiers.
testing \p{Control} U+0006 ACKNOWLEDGE...is LEGAL in Java identifiers.
testing \p{Control} U+0007 BELL...is LEGAL in Java identifiers.
testing \p{Control} U+0008 BACKSPACE...is LEGAL in Java identifiers.
testing \p{Control} U+000B LINE TABULATION...is forbidden in Java identifiers.
testing \p{Control} U+000E SHIFT OUT...is LEGAL in Java identifiers.
testing \p{Control} U+000F SHIFT IN...is LEGAL in Java identifiers.
testing \p{Control} U+0010 DATA LINK ESCAPE...is LEGAL in Java identifiers.
testing \p{Control} U+0011 DEVICE CONTROL ONE...is LEGAL in Java identifiers.
testing \p{Control} U+0012 DEVICE CONTROL TWO...is LEGAL in Java identifiers.
testing \p{Control} U+0013 DEVICE CONTROL THREE...is LEGAL in Java identifiers.
testing \p{Control} U+0014 DEVICE CONTROL FOUR...is LEGAL in Java identifiers.
testing \p{Control} U+0015 NEGATIVE ACKNOWLEDGE...is LEGAL in Java identifiers.
testing \p{Control} U+0016 SYNCHRONOUS IDLE...is LEGAL in Java identifiers.
testing \p{Control} U+0017 END OF TRANSMISSION BLOCK...is LEGAL in Java identifiers.
testing \p{Control} U+0018 CANCEL...is LEGAL in Java identifiers.
testing \p{Control} U+0019 END OF MEDIUM...is LEGAL in Java identifiers.
testing \p{Control} U+001A SUBSTITUTE...is LEGAL in Java identifiers.
testing \p{Control} U+001B ESCAPE...is LEGAL in Java identifiers.
testing \p{Control} U+001C INFORMATION SEPARATOR FOUR...is forbidden in Java identifiers.
testing \p{Control} U+001D INFORMATION SEPARATOR THREE...is forbidden in Java identifiers.
testing \p{Control} U+001E INFORMATION SEPARATOR TWO...is forbidden in Java identifiers.
testing \p{Control} U+001F INFORMATION SEPARATOR ONE...is forbidden in Java identifiers.
Legal but evil code points: U+0000 U+0001 U+0002 U+0003 U+0004 U+0005 U+0006 U+0007 U+0008 U+000E U+000F U+0010 U+0011 U+0012 U+0013 U+0014 U+0015 U+0016 U+0017 U+0018 U+0019 U+001A U+001B

这是另一个演示:

$ perl test-java-idchars 0x600 0x700 | grep -i legal
testing \p{Control} U+0600 ARABIC NUMBER SIGN...is LEGAL in Java identifiers.
testing \p{Control} U+0601 ARABIC SIGN SANAH...is LEGAL in Java identifiers.
testing \p{Control} U+0602 ARABIC FOOTNOTE MARKER...is LEGAL in Java identifiers.
testing \p{Control} U+0603 ARABIC SIGN SAFHA...is LEGAL in Java identifiers.
testing \p{Control} U+06DD ARABIC END OF AYAH...is LEGAL in Java identifiers.
Legal but evil code points: U+0600 U+0601 U+0602 U+0603 U+06DD



问题



任何人都可以解释这个看似疯狂的行为吗?在整个地方有很多很多其他令人费解的许可代码点,从U + 0000开始,这可能是最奇怪的。如果在第一个0x1000代码点上运行它,则会看到某些模式,例如允许使用属性 Current_Symbol 的任何和所有代码点。但是太多其他事情是完全无法解释的,至少是我。

The Question

Can anyone please explain this seemingly insane behavior? There are many, many, many other inexplicably permitted code points all over the place, starting right off with U+0000, which is perhaps the strangest of all. If you run it on the first 0x1000 code points, you do see certain patterns appear, such as permitting any and all code points with the property Current_Symbol. But too much else is wholly inexplicable, at least by me.

推荐答案

Java语言规范第3.8节推迟到 Character.isJavaIdentifierStart() Character.isJavaIdentifierPart()。除其他条件外,后者有 Character.isIdentifierIgnorable( ),允许非空白控制字符(包括整个C1范围,请参阅列表的链接)。

The Java Language Specification section 3.8 defers to Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart(). The latter, among other conditions, has Character.isIdentifierIgnorable(), which allows non-whitespace control characters (including whole C1 range, see the link for the list).

这篇关于为什么Java允许控制字符在其标识符中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆