字符编码独立的性格互换 [英] Character Encoding independent character swap

查看:151
本文介绍了字符编码独立的性格互换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我喜欢当我想扭转一个字符串,用这块code的。<子> [当我不使用的std ::字符串或其他在ç内置的功能] 。当我最初以为这个我有 ASCII 表记初学者。我想,这可以做工精良的Uni code 太。我认为自从值(ASCII等)的区别是固定的,所以它的工作原理。

I like to use this piece of code when I want to reverse a string. [When I am not using std::string or other inbuilt functions in C]. As a beginner when I initially thought of this I had ASCII table in mind. I think this can work well with Unicode too. I assumed since the difference in values (ASCII etc) is fixed, so it works.

是否有任何字符编码在此code可能不工作?

char a[11],t;
int len,i;
strcpy(a,"Particl");    
printf("%s\n",a);
len = strlen(a);
for(i=0;i<(len/2);i++)
{
    a[i] += a[len-1-i];
    a[len-1-i] = a[i] - a[len-1-i];
    a[i] -= a[len-1-i];
}
printf("%s\n",a);

<分> 更新:

<子> 链接与此问题相关联信息。

This link is informative in association with this question.

推荐答案

这不会与任何编码工作,其中一些(不一定是全部)codepoints需要一个以上的字符单元重新present,因为你在倒车逐字节而不是$ C $连接点逐$ C $连接点。对于通常的8位字符这包括的所有的编码可以重新present所有的Uni code的。

This will not work with any encoding in which some (not necessarily all) codepoints require more than one char unit to represent, because you are reversing byte-by-byte instead of codepoint-by-codepoint. For the usual 8-bit char this includes all encodings that can represent all of Unicode.

例如:在UTF-16BE,字符串你好映射到字节序列 00 68 00 65 00 00 6C 00 6C 6F 。你的算法应用于此字节序列将产生序列 1207米00 6C 00 6C 00 65 00 68 00 ,这是字符串漀氀氀攀的UTF-16BE编码栀。

For example: in UTF-16BE, the string "hello" maps to the byte sequence 00 68 00 65 00 6c 00 6c 00 6f. Your algorithm applied to this byte sequence will produce the sequence 6f 00 6c 00 6c 00 65 00 68 00, which is the UTF-16BE encoding of the string "漀氀氀攀栀".

更糟糕 - 做一个统一code字符串的$ C $连接点逐$ C $连接点反转仍不会产生在所有情况下正确的结果,因为统一code有许多codepoints作用于周围的环境,而不是独自站在为字符。作为一个简单的例子,$ C $连接点反转字符串腰穿,它包含U + 0308 COMBINING二分法,将会产生帕特länıpS - 怎么看二分法已经从N到A迁移? $ C $连接点逐$ C $连接点反转一个字符串包含双向替代或conjoining JAMO后果将更加可怕。

It gets worse -- doing a codepoint-by-codepoint reversal of a Unicode string still won't produce the correct results in all cases, because Unicode has many codepoints that act on their surroundings rather than standing alone as characters. As a trivial example, codepoint-reversing the string "Spın̈al Tap", which contains U+0308 COMBINING DIAERESIS, will produce "paT länıpS" -- see how the diaeresis has migrated from the N to the A? The consequences of codepoint-by-codepoint reversal on a string containing bidirectional overrides or conjoining jamo would be even more dire.

这篇关于字符编码独立的性格互换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆