目标c中unicode字符的数值 [英] numerical value of a unicode character in objective c

查看:105
本文介绍了目标c中unicode字符的数值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以从objective-c中的unicode字符中获取数值。



@A是0041,@➜是279C, Ω是03A9,@झ是091D ...?

解决方案

事情在一个单独的答案在这里。首先,术语性格是含糊的,所以我们应该根据我们的意思选择一个更合适的术语。 (请参见字符和格式集群在苹果开发者文档中,以及 Unicode网站了解详情。)



如果您要求使用UTF-16 代码单元,则可以使用

  unichar ch = [myString characterAtIndex:ndx]; 

请注意,这只是 等效于Unicode代码点代码点在基本多语言平面内的情况(即,它小于U + FFFF)。



如果您要求Unicode 代码点 ,那么您应该注意,UTF-16支持使用替代对在BMP外部的字符(即U + 10000及以上)。因此,对于U + 10000以上的任何代码点,将会有两个 UTF-16代码单元。要检测这种情况,你需要做一些像

  uint32_t codepoint = [myString characterAtIndex:ndx]; 

if((codepoint& 0xfc00)== 0xd800){
unichar ch2 = [myString characterAtIndex:ndx + 1];

codepoint =(((codepoint& 0x3ff)<< 10)|(ch2& 0x3ff))+ 0x10000;
}

请注意,在生产代码中,您还应该测试并处理

重要的是,UTF-16代码单元和Unicode代码点都不一定对应于任何内容,并且最终用户将其视为字符(Unicode财团通常将其称为字形集群以将其与字符的其他可能的含义区分开)。有很多例子,但最简单的理解可能是组合变音符号。例如,字符Ä可以表示为Unicode代码点U + 00C4,或者表示为一对代码点U + 0041 U + 0308。



有时候人们(像@DietrichEpp在他的回答的评论中)将声称你可以通过转换为预组合形式在处理你的字符串之前处理这个。这是一个红色鲱鱼,因为预组合形式只涉及在Unicode中具有预组合等效的字符。例如它不会帮助所有结合标记;它不会帮助印度语或阿拉伯语的脚本;它不会帮助韩文Jamos。



如果您尝试操作字形集群 认为是字符),你应该可以使用NSString方法 -rangeOfComposedCharacterSequencesForRange: rangeOfComposedCharacterSequenceAtIndex: CFString函数 CFStringGetRangeOfComposedCharactersAtIndex 。显然,你不能在整数变量中持有一个字形集群,它没有固有的数值;相反,它由代码点串表示,其由代码单元串表示。例如:

  NSRange gcRange = [myString rangeOfComposedCharacterSequenceAtIndex:ndx]; 
NSString * graphemeCluster = [myString substringWithRange:gcRange];

注意 graphemeCluster 可以任意长!)



即使如此,我们已经忽略了诸如Unicode对双向文本的支持等问题的影响。也就是说,在某些情况下,您的NSString 中的代码单元表示的代码点的顺序可能与您期望的相反。最糟糕的情况涉及英语文本嵌入阿拉伯语或希伯来语;



总而言之: >一般来说,应避免使用unichar 来检查 NSString CFString 实例unichar。如果可能,请改用适当的 NSString 方法或 CFString 函数。如果你发现自己检查UTF-16代码单元,请首先熟悉Unicode标准(如果你不能通过Unicode书本身阅读,我推荐Unicode Demystified),因此,你可以避免主要的陷阱。


is it possible to get a numerical value from a unicode character in objective-c?

@"A" is 0041, @"➜" is 279C, @"Ω" is 03A9, @"झ" is 091D... ?

解决方案

OK, so it’s perhaps worth pointing a few things out in a separate answer here. First, the term "character" is ambiguous, so we should choose a more appropriate term depending on what we mean. (See Characters and Grapheme Clusters in the Apple developer docs, as well as the Unicode website for more detail.)

If you are asking for the UTF-16 code unit, then you can use

unichar ch = [myString characterAtIndex:ndx];

Note that this is only equivalent to a Unicode code-point in the case where the code point is within the Basic Multilingual Plane (i.e. it is less than U+FFFF).

If you are asking for the Unicode code point, then you should be aware that UTF-16 supports characters outside of the BMP (i.e. U+10000 and above) using surrogate pairs. Thus there will be two UTF-16 code units for any code point above U+10000. To detect this case, you need to do something like

uint32_t codepoint = [myString characterAtIndex:ndx];

if ((codepoint & 0xfc00) == 0xd800) {
  unichar ch2 = [myString characterAtIndex:ndx + 1];

  codepoint = (((codepoint & 0x3ff) << 10) | (ch2 & 0x3ff)) + 0x10000;
}

Note that in production code, you should also test for and cope with the case where the surrogate pair has been truncated somehow.

Importantly, neither UTF-16 code units, nor Unicode code points necessarily correspond to anything that and end-user would regard as a "character" (the Unicode consortium generally refers to this as a grapheme cluster to distinguish it from other possible meanings of "character"). There are many examples, but the simplest to understand are probably the combining diacritical marks. For instance, the character ‘Ä’ can be represented as the Unicode code point U+00C4, or as a pair of code points, U+0041 U+0308.

Sometimes people (like @DietrichEpp in the comments on his answer) will claim that you can deal with this by converting to precomposed form before dealing with your string. This is something of a red herring, because precomposed form only deals with characters that have a precomposed equivalent in Unicode. e.g. it will not help with all combining marks; it will not help with Indic or Arabic scripts; it will not help with Hangul Jamos. There are many other cases as well.

If you are trying to manipulate grapheme clusters (things the user might think of as "characters"), you should probably make use of the NSString methods -rangeOfComposedCharacterSequencesForRange:, rangeOfComposedCharacterSequenceAtIndex: or the CFString function CFStringGetRangeOfComposedCharactersAtIndex. Obviously you cannot hold a grapheme cluster in an integer variable and it has no inherent numerical value; rather, it is represented by a string of code points, which are represented by a string of code units. For instance:

NSRange gcRange = [myString rangeOfComposedCharacterSequenceAtIndex:ndx];
NSString *graphemeCluster = [myString substringWithRange:gcRange];

Note that graphemeCluster may be arbitrarily long(!)

Even then, we have ignored the effects of matters such as Unicode’s support for bidirectional text. That is, the order of the code points represented by the code units in your NSString may in some cases be the reverse of what you might expect. The worse cases involve things like English text embedded in Arabic or Hebrew; this is supported by the Cocoa Text system, and so you really can end up with bidirectional strings in your code.

To summarise: generally speaking one should avoid examining NSString and CFString instances unichar by unichar. If at all possible, use an appropriate NSString method or CFString function instead. If you do find yourself examining the UTF-16 code units, please familiarise yourself with the Unicode standard first (I recommend "Unicode Demystified" if you can’t stomach reading through the Unicode book itself), so that you can avoid the major pitfalls.

这篇关于目标c中unicode字符的数值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆