你如何应付符号的字符 - >诠释与标准库的问题? [英] How do you cope with signed char -> int issues with standard library?

查看:213
本文介绍了你如何应付符号的字符 - >诠释与标准库的问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的工作真的长期存在的问题,我意识到我的还是的没有一个很好的解决...

ç天真地定义它的所有字符的测试功能为一个int:

  INT isspace为(INT CH);

但焦炭的往往是签名,一个充满个性往往不适合用int或任何单个存储​​单元是用于字符串* *

和这些功能对于当前的C ++函数和方法的逻辑模板,并设置了阶段为目前的标准库。事实上,他们仍然支持,AFAICT。

所以,如果你的手isspace为(* PChar类型),您可以用符号扩展问题结束。他们是很难看到的,因此他们很难防范在我的经验。

同样,因为isspace为()和它的同类都以整数,而且由于角色的实际宽度往往是未知的W / O线分析 - 这意味着任何现代字库基本上应该永远围绕焦炭的或wchar_t的的,但可以卡丁车通过分析字符流只有指针/迭代器,因为只有你能知道多少它组成一个单一的逻辑性格,我是有点茫然,如何更好地接近问题的?

我一直期待围绕抽象掉任何字符的大小因素,只有用字符串(提供诸如isspace为等)工作的一个真正强大的库,但无论是我已经错过了,或者还有另一种更简单解决方案盯着我,所有的你(谁知道你在做什么)使用人脸...


**这些问题不拿出来,可以完全包含了完整的固定大小的字符编码​​ - UTF-32显然是关于具有这些特性(或特定环境限制自己ASCII或唯一的选择一些这样的)。


所以,我的问题是:

你怎么测试的空白,isprintable等,在某种程度上,这并不两个问题的影响:

1)登录扩张,结果
2)可变宽度人品问题

毕竟,大多数的字符编码的是可变宽度:UTF-7,UTF-8,UTF-16,以及旧的标准,如按住Shift JIS。如果编译器将CHAR作为一个符号的8位单位甚至扩展ASCII可以有简单的符号扩展问题。

请注意:

不管你char_type是什么尺寸,这是错的大多数字符编码方案。

这个问题是在标准C库,以及在C ++标准库;仍然试图绕过char和wchar_t的,而不是字符串迭代器在各个isspace为,isprint判断等实现。

其实,这是precisely那些类型的破裂的std ::字符串的通用性功能。如果仅在存储单位工作过,并没有尝试pretend了解存储设备作为逻辑字符(如isspace为)的含义,那么抽象会更诚实,并会迫使我们程序员去别处寻找有效的解决方案...

谢谢

大家谁参加了会议。本次讨论和 WChars之间,编码标准和可移植性我有一个更好的处理上问题。虽然没有简单的答案,理解每一点帮助。


解决方案

  

你怎么测试的空白,isprintable等,不从两个问题的影响的一种方式:结果
  1)登录扩张结果
  2)可变宽度人品问题结果
  毕竟,所有常用的Uni code编码是可变宽度,程序员是否意识到这一点:UTF-7,UTF-8,UTF-16,以及旧的标准,如按住Shift JIS ...


显然,你必须使用一个统一code感知库,因为你已经证明了(正确地)C ++ 03标准库是没有的。在C ++ 11库好转,但仍不能很好地满足大多数应用已经足够了。是的,有些OS'有一个32位的wchar_t这使得他们能够正确地处理UTF32,但是这是一个实现,而不是由C ++保证,是对许多单code任务远程够用,如迭代结束字形(字母)。

IBMICU 结果
libiconv的结果
microUTF-8 结果
UTF-8 CPP,1.0版结果
utfproc 结果
和许多在 HTTP://uni$c$c.org/resources/libraries.html

如果该问题已经不再是具体的性格测试和更多关于通用code做法:做任何你的框架做。如果你编码为Linux / QT /网络,以UTF-8保留一切在内部。如果你正在使用Windows编码,在UTF-16保留一切在内部。如果你需要用code点一塌糊涂,在UTF-32保留一切在内部。否则(便携式,通用code),做任何你想要的,因为不管是什么,你要转换一些操作系统或其他无妨。

This is a really long-standing issue in my work, that I realize I still don't have a good solution to...

C naively defined all of its character test functions for an int:

int isspace(int ch);

But char's are often signed, and a full character often doesn't fit in an int, or in any single storage-unit that used for strings**.

And these functions have been the logical template for current C++ functions and methods, and have set the stage for the current standard library. In fact, they're still supported, afaict.

So if you hand isspace(*pchar) you can end up with sign extension problems. They're hard to see, and thence they're hard to guard against in my experience.

Similarly, because isspace() and it's ilk all take ints, and because the actual width of a character is often unknown w/o string-analysis - meaning that any modern character library should essentially never be carting around char's or wchar_t's but only pointers/iterators, since only by analyzing the character stream can you know how much of it composes a single logical character, I am at a bit of a loss as to how best to approach the issues?

I keep expecting a genuinely robust library based around abstracting away the size-factor of any character, and working only with strings (providing such things as isspace, etc.), but either I've missed it, or there's another simpler solution staring me in the face that all of you (who know what you're doing) use...


** These issues don't come up for fixed-sized character-encodings that can wholly contain a full character - UTF-32 apparently is about the only option that has these characteristics (or specialized environments that restrict themselves to ASCII or some such).


So, my question is:

"How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:

1) Sign expansion, and
2) variable-width character issues

After all, most character encodings are variable-width: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS. Even extended ASCII can have the simple sign-extension problem if the compiler treats char as a signed 8 bit unit.

Please note:

No matter what size your char_type is, it's wrong for most character encoding schemes.

This problem is in the standard C library, as well as in the C++ standard libraries; which still tries to pass around char and wchar_t, rather than string-iterators in the various isspace, isprint, etc. implementations.

Actually, it's precisely those type of functions that break the genericity of std::string. If it only worked in storage-units, and didn't try to pretend to understand the meaning of the storage-units as logical characters (such as isspace), then the abstraction would be much more honest, and would force us programmers to look elsewhere for valid solutions...

Thank You

Everyone who participated. Between this discussion and WChars, Encodings, Standards and Portability I have a much better handle on the issues. Although there are no easy answers, every bit of understanding helps.

解决方案

How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion
2) variable-width character issues
After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS...

Obviously, you have to use a Unicode-aware library, since you've demonstrated (correctly) that C++03 standard library is not. The C++11 library is improved, but still not quite good enough for most usages. Yes, some OS' have a 32-bit wchar_t which makes them able to correctly handle UTF32, but that's an implementation, and is not guaranteed by C++, and is not remotely sufficient for many unicode tasks, such as iterating over Graphemes (letters).

IBMICU
Libiconv
microUTF-8
UTF-8 CPP, version 1.0
utfproc
and many more at http://unicode.org/resources/libraries.html.

If the question is less about specific character testing and more about code practices in general: Do whatever your framework does. If you're coding for linux/QT/networking, keep everything internally in UTF-8. If you're coding with Windows, keep everything internally in UTF-16. If you need to mess with code points, keep everything internally in UTF-32. Otherwise (for portable, generic code), do whatever you want, since no matter what, you have to translate for some OS or other anyway.

这篇关于你如何应付符号的字符 - >诠释与标准库的问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆