有效的 Unicode 字符串可以包含 FFFF 吗?Java/CharacterIterator 坏了吗? [英] Can a valid Unicode string contain FFFF? Is Java/CharacterIterator broken?

查看:14
本文介绍了有效的 Unicode 字符串可以包含 FFFF 吗?Java/CharacterIterator 坏了吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是 java.text 的摘录.CharacterIterator 文档:

Here's an excerpt from java.text.CharacterIterator documentation:

  • 这个interface 定义了一个双向迭代文本的协议.迭代器迭代一个有界字符序列.[...] 方法 previous()next() 用于迭代.如果 [...],它们返回 DONE,表明迭代器已经到达序列的末尾.

  • This interface defines a protocol for bidirectional iteration over text. The iterator iterates over a bounded sequence of characters. [...] The methods previous() and next() are used for iteration. They return DONE if [...], signaling that the iterator has reached the end of the sequence.

静态最终char DONE:当迭代器到达文本的结尾或开头时返回的常量.值是 uFFFF,非字符"值不应出现在任何有效的 Unicode 字符串中.

斜体部分是我无法理解的部分,因为从我的测试来看,它看起来像 Java String 肯定可以包含 uFFFF,并且没有'它似乎没有任何问题,除非显然规定的 CharacterIterator 遍历习语由于误报而中断(例如 next() 返回 'uFFFF' == DONE 当它没有真正完成"时).

The italicized part is what I'm having trouble understanding, because from my tests, it looks like a Java String can most certainly contain uFFFF, and there doesn't seem to be any problem with it, except obviously with the prescribed CharacterIterator traversal idiom that breaks because of a false positive (e.g. next() returns 'uFFFF' == DONE when it's not really "done").

这是一个说明问题"的片段;(另见 ideone.com):

Here's a snippet to illustrate the "problem" (see also on ideone.com):

import java.text.*;
public class CharacterIteratorTest {

    // this is the prescribed traversal idiom from the documentation
    public static void traverseForward(CharacterIterator iter) {
       for(char c = iter.first(); c != CharacterIterator.DONE; c = iter.next()) {
          System.out.print(c);
       }
    }

    public static void main(String[] args) {
        String s = "abcuFFFFdef";

        System.out.println(s);
        // abc?def

        System.out.println(s.indexOf('uFFFF'));
        // 3
        
        traverseForward(new StringCharacterIterator(s));
        // abc
    }
}

那么这里发生了什么?

  • 规定的遍历成语是broken"吗?因为它对 uFFFF 做出了错误的假设?
  • StringCharacterIterator 实现是否损坏"?因为它没有,例如throw 一个 IllegalArgumentException 如果实际上 uFFFF 在有效的 Unicode 字符串中被禁止?
  • 有效的 Unicode 字符串真的不应该包含 uFFFF 吗?
  • 如果这是真的,那么 Java 是坏掉的"吗?是否因为(在大多数情况下)允许 String 包含 uFFFF 而违反了 Unicode 规范?
  • Is the prescribed traversal idiom "broken" because it makes the wrong assumption about uFFFF?
  • Is the StringCharacterIterator implementation "broken" because it doesn't e.g. throw an IllegalArgumentException if in fact uFFFF is forbidden in valid Unicode strings?
  • Is it actually true that valid Unicode strings should not contain uFFFF?
  • If that's true, then is Java "broken" for violating the Unicode specification by (for the most parts) allowing String to contain uFFFF anyway?

推荐答案

EDIT (2013-12-17): Peter O. 在下面提出了一个很好的观点,这使得这个答案是错误的.下面的旧答案,为了历史准确性.

EDIT (2013-12-17): Peter O. brings up an excellent point below, which renders this answer wrong. Old answer below, for historical accuracy.

回答您的问题:

没有.U+FFFF就是所谓的非字符.来自 Unicode 标准的第 16.7 节:

No. U+FFFF is a so-called non-character. From Section 16.7 of the Unicode Standard:

非字符是 Unicode 标准中永久保留供内部使用的代码点.禁止在 Unicode 文本数据的开放交换中使用它们.

Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data.

...

Unicode 标准预留了 66 个非字符代码点.最后两个代码点每个平面都是非字符:BMP 上的 U+FFFE 和 U+FFFF,U+1FFFE 和 U+1FFFF在平面 1 上,依此类推,直到平面 16 上的 U+10FFFE 和 U+10FFFF,总共 34 个代码点.此外,还有另外 32 个非字符代码点的连续范围BMP:U+FDD0..U+FDEF.

The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF.

StringCharacterIterator 实现是否损坏",因为它没有例如如果实际上 uFFFF 在有效的 Unicode 字符串中被禁止,则抛出 IllegalArgumentException?

不完全是.允许应用程序以他们想要的任何方式内部使用这些代码点.再次引用标准:

Is the StringCharacterIterator implementation "broken" because it doesn't e.g. throw an IllegalArgumentException if in fact uFFFF is forbidden in valid Unicode strings?

Not quite. Applications are allowed to use those code points internally in any way they want. Quoting the standard again:

应用程序可以在内部自由使用任何这些非字符代码点,但应该永远尝试交换它们.如果在开放交换中接收到非字符,则应用程序不需要以任何方式解释它.但是,将其识别为非字符并采取适当措施(例如将其替换为 U+FFFD REPLACEMENT CHARACTER)以指出文本中的问题是一种很好的做法.不建议由于潜在的安全性,只需从此类文本中删除非字符代码点删除未解释的字符导致的问题.

Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD REPLACEMENT CHARACTER, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters.

因此,虽然您永远不应该遇到来自用户、另一个应用程序或文件的此类字符串,但如果您知道自己在做什么,则很可能将其放入 Java 字符串中(这基本上意味着您不能在不过那个字符串.

So while you should never encounter such a string from the user, another application or a file, you may well put it into a Java String if you know what you're doing (this basically means that you cannot use the CharacterIterator on that string, though.

如上所述,任何用于交换的字符串不得包含它们.在您的应用程序中,您可以随意使用它们.

As quoted above, any string used for interchange must not contain them. Within your application you're free to use them in whatever way they want.

当然,Java char 只是一个 16 位无符号整数,并不真正关心它所保存的值.

Of course, a Java char, being just a 16-bit unsigned integer doesn't really care about the value it holds as well.

没有.事实上,关于非字符的部分甚至建议使用 U+FFFF 作为标记值:

No. In fact, the section on noncharacters even suggests the use of U+FFFF as sentinel value:

实际上,可以将非字符视为应用程序内部专用代码点.与第 16.5 节,私有字符中讨论的私有字符不同,后者被分配了字符,用于开放交换,受制于由私人协议解释,非字符被永久保留(未分配)并且在他们可能的应用程序内部私有之外没有任何解释使用.

In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses.

U+FFFF 和 U+10FFFF. 这两个非字符代码点的属性是与特定 Unicode 编码形式的最大代码单元值相关联.在UTF-16,U+FFFF 与最大的 16 位代码单元值 FFFF16 相关联.U+10FFFF 是与最大的合法 UTF-32 32 位代码单元值 10FFFF16 相关联.这个属性将这两个非字符代码点呈现为对内部目的有用的标记.为了例如,它们可能用于指示列表的结尾,以表示索引中的值保证高于任何有效字符值,依此类推.

U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16. U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16. This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on.

CharacterIterator 遵循此原则,当没有更多字符可用时,它返回 U+FFFF.当然,这意味着如果您在应用程序中对该代码点有其他用途,您可以考虑为此使用不同的非字符,因为 U+FFFF 已经被采用——至少在您使用 CharacterIterator 的情况下.

CharacterIterator follows this in that it returns U+FFFF when no more characters are available. Of course, this means that if you have another use for that code point in your application you may consider using a different non-character for that purpose since U+FFFF is already taken – at least if you're using CharacterIterator.

这篇关于有效的 Unicode 字符串可以包含 FFFF 吗?Java/CharacterIterator 坏了吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆