有效的Unicode字符串可以包含FFFF吗? Java / CharacterIterator坏了吗? [英] Can a valid Unicode string contain FFFF? Is Java/CharacterIterator broken?

查看:92
本文介绍了有效的Unicode字符串可以包含FFFF吗? Java / CharacterIterator坏了吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是 的摘录java.text.CharacterIterator 文档:

Here's an excerpt from java.text.CharacterIterator documentation:



  • 接口定义了一个用于文本双向迭代的协议。迭代器迭代有界字符序列。 [...]方法 previous() next()用于迭代。如果[...]返回 DONE ,则表示迭代器已到达序列的末尾。

  • This interface defines a protocol for bidirectional iteration over text. The iterator iterates over a bounded sequence of characters. [...] The methods previous() and next() are used for iteration. They return DONE if [...], signaling that the iterator has reached the end of the sequence.

静态决赛char DONE :迭代器到达文本的结尾或开头时返回的常量。值为 \\\FFFF ,非字符值不应出现在任何有效的Unicode字符串中。

斜体部分是我无法理解的部分,因为从我的测试来看,它看起来像一个Java 字符串肯定可以包含 \ uFFFF ,并且它似乎没有任何问题,除了显然,使用规定的 CharacterIterator 由于误报而中断的遍历习惯用法(例如 next()返回'\ uFFFF'== DONE 当它没有真正完成时)。

The italicized part is what I'm having trouble understanding, because from my tests, it looks like a Java String can most certainly contain \uFFFF, and there doesn't seem to be any problem with it, except obviously with the prescribed CharacterIterator traversal idiom that breaks because of a false positive (e.g. next() returns '\uFFFF' == DONE when it's not really "done").

这是一个片段来说明问题 (另见ideone.com ):

Here's a snippet to illustrate the "problem" (see also on ideone.com):

import java.text.*;
public class CharacterIteratorTest {

    // this is the prescribed traversal idiom from the documentation
    public static void traverseForward(CharacterIterator iter) {
       for(char c = iter.first(); c != CharacterIterator.DONE; c = iter.next()) {
          System.out.print(c);
       }
    }

    public static void main(String[] args) {
        String s = "abc\uFFFFdef";

        System.out.println(s);
        // abc?def

        System.out.println(s.indexOf('\uFFFF'));
        // 3

        traverseForward(new StringCharacterIterator(s));
        // abc
    }
}

那么结果如何在这里?


  • 规定的遍历习语是否破坏,因为它对 \ uFFFF <做出了错误的假设/ code>?

  • StringCharacterIterator 实施是否已损坏,因为它不是例如抛出一个 IllegalArgumentException 如果实际上 \ uFFFF 被禁止在有效的Unicode字符串中?

  • 实际上有效的Unicode字符串是否应该包含 \ uFFFF

  • 如果这是真的,那么Java是因为(大多数部分)违反Unicode规范而破解,允许 String 包含 \ uFFFF 无论如何?

  • Is the prescribed traversal idiom "broken" because it makes the wrong assumption about \uFFFF?
  • Is the StringCharacterIterator implementation "broken" because it doesn't e.g. throw an IllegalArgumentException if in fact \uFFFF is forbidden in valid Unicode strings?
  • Is it actually true that valid Unicode strings should not contain \uFFFF?
  • If that's true, then is Java "broken" for violating the Unicode specification by (for the most parts) allowing String to contain \uFFFF anyway?

推荐答案

编辑( 2013-12-17): Peter O. 在下面提出了一个很好的观点,这使得答案错误。以下旧答案,历史准确性。

EDIT (2013-12-17): Peter O. brings up an excellent point below, which renders this answer wrong. Old answer below, for historical accuracy.

回答你的问题:

否。 U + FFFF是所谓的非角色。来自 Unicode标准的第16.7节

No. U+FFFF is a so-called non-character. From Section 16.7 of the Unicode Standard:


非字符是Unicode标准中永久保留供内部使用的代码点。它们被禁止用于开放交换Unicode文本数据。

Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data.

...

Unicode标准集除了66个非字符代码点。
每个平面的最后两个代码点都是非字符:BMP上的U + FFFE和U + FFFF,平面1上的U + 1FFFE和U + 1FFFF
,依此类推,直到U + 10FFFE平面16上的U + 10FFFF,共34个代码
点。此外,BMP:$ U $ FDD0..U + FDEF中还有另外32个非特征代码点的连续范围:

The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF.



StringCharacterIterator实现是否损坏,因为它不是例如如果实际上在有效的Unicode字符串中禁止\ uFFFF,则抛出IllegalArgumentException?



不完全。允许应用程序以他们想要的任何方式在内部内部使用这些代码点。再次引用标准:


应用程序可以在内部使用任何这些非字符代码点,但应该
从不尝试交换它们。如果在开放式交换中收到非字符,则不需要
应用程序以任何方式解释它。但是,将它识别为非字符并采取适当的操作(例如将其替换为U + FFFD REPLACEMENT CHARACTER)以指示文本中的问题是一种很好的做法。不建议
只删除此类文本中的非字符代码点,因为删除未解释的字符会导致潜在的安全
问题。

Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD REPLACEMENT CHARACTER, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters.

因此,虽然您永远不会遇到来自用户,其他应用程序或文件的字符串,但如果您知道自己在做什么,则可以将其放入Java字符串中(这基本上意味着您不能但是,在该字符串上使用CharacterIterator。

So while you should never encounter such a string from the user, another application or a file, you may well put it into a Java String if you know what you're doing (this basically means that you cannot use the CharacterIterator on that string, though.

如上所述,用于交换的任何字符串都不能包含它们。在您的应用程序中,您可以随意使用它们。

As quoted above, any string used for interchange must not contain them. Within your application you're free to use them in whatever way they want.

当然,Java char ,只是一个16位无符号整数,并不关心它所持有的值。

Of course, a Java char, being just a 16-bit unsigned integer doesn't really care about the value it holds as well.

否。实际上,关于非字符的部分甚至建议使用U + FFFF作为哨兵值:

No. In fact, the section on noncharacters even suggests the use of U+FFFF as sentinel value:


实际上,非字符可以被认为是应用程序内部私有代码点。
第16.5节私人使用字符中讨论的私人使用字符不同,其中
被分配字符,并且打算用于公开交换,受$ b $限制b私人协议解释,非人格永久保留(未分配)
并且在他们可能的申请之外没有任何解释 - 内部私人
使用。

In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses.

U + FFFF和U + 10FFFF。这两个非特征代码点具有与特定Unicode编码形式的最大代码单元值相关联的
属性。在
UTF-16中,U + FFFF与最大的16位代码单元值FFFF 16 相关联。 U + 10FFFF是
,与最大的合法UTF-32 32位代码单元值10FFFF 16 相关联。此属性
将这两个非字符代码点用作内部目的作为标记。对于
示例,它们可能用于表示列表的结尾,表示索引
中的值保证高于任何有效字符值,依此类推。

U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16. U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16. This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on.

CharacterIterator遵循这一点,因为当没有更多字符可用时它返回U + FFFF。当然,这意味着如果您在应用程序中对该代码点有另一种用途,您可以考虑为此目的使用不同的非字符,因为已经采用了U + FFFF - 至少如果您使用的是CharacterIterator。

CharacterIterator follows this in that it returns U+FFFF when no more characters are available. Of course, this means that if you have another use for that code point in your application you may consider using a different non-character for that purpose since U+FFFF is already taken – at least if you're using CharacterIterator.

这篇关于有效的Unicode字符串可以包含FFFF吗? Java / CharacterIterator坏了吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆