如何处理signed char - > int问题与标准库? [英] How do you cope with signed char -> int issues with standard library?

查看:142
本文介绍了如何处理signed char - > int问题与标准库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我工作中一个长期存在的问题,我意识到我仍然没有好的解决方案...



C定义了一个int的所有字符测试函数:

  int isspace 

但是char通常是签名的,而且一个完整的字符通常不适合int,任何用于字符串* * 的单个存储单元。



这些函数是当前C ++函数和方法的逻辑模板,为当前标准库设置了舞台。事实上,他们仍然支持,afaict。



因此,如果你手isspace(* pchar),你可能会遇到符号扩展问题。



同样,因为isspace()和它的ilk都接受ints,因为它们很容易看到,因此他们在我的经验中很难防范。一个字符的实际宽度通常是未知的w / o字符串分析 - 意味着任何现代字符库本质上不应该在char或wchar_t的周围,而只是指针/迭代器,因为只有通过分析字符流,你才能知道有多少它构成一个单一的逻辑字符,我对于如何最好地处理这些问题有点失落?



我期望一个真正强大的库基于抽象任何字符的大小因子,并且只与字符串(提供诸如isspace等东西),但是我错过了,或者有另一个更简单的解决方案凝视着我的脸,所有的人(谁知道什么你正在做)使用...






**这些问题不会出现固定大小的字符 - 可以完全包含完整字符的编码 - UTF-32显然是具有这些特征的唯一选项(或将自己限制为ASCII或某些特殊环境的特殊环境)。






所以,我的问题是:



如何测试空格,可打印等,不会遇到两个问题:



1)符号展开和

2)变长字符问题

$ b $毕竟,大多数字符编码是可变宽度:UTF-7,UTF-8,UTF-16,以及旧标准,如Shift-JIS。



请注意:










无论你的char_type是什么大小,它对于大多数字符编码方案都是错误的。



这个问题在标准C库中, C ++标准库;它仍然试图传递char和wchar_t,而不是字符串迭代器在各种isspace,isprint等实现。



实际上,这正是那些类型的函数打破std :: string的一般性。如果它只在存储单元中工作,并且没有试图假装理解存储单元作为逻辑字符(例如isspace)的含义,那么抽象将更加诚实,并且将迫使我们的程序员看



谢谢



所有参与的人。在此讨论和 WChars,Encodings,Standards和Portability 之间,我有一个更好的处理问题。

解决方案


如何测试空白,可打印等,不会遇到两个问题:

1)符号扩展

2)可变宽度字符问题

毕竟,所有常用的Unicode编码都是可变宽度的,无论程序员是否实现它:UTF-7,UTF-8,UTF-16以及旧标准,如Shift-JIS ...


显然,你必须使用一个支持Unicode的库,因为你已经证明(正确)C ++ 03标准库不是。 C ++ 11库已改进,但对于大多数用法仍然不够好。是的,一些操作系统有一个32位的wchar_t,使他们能够正确处理UTF32,但这是一个实现,并不是由C + +保证,并不是远远足够许多unicode任务,如迭代Graphemes(字母) 。



IBMICU

Libiconv

microUTF-8

UTF-8 CPP,版本1.0

utfproc

和更多在 http://unicode.org/resources/libraries.html



如果问题较少关于特定字符测试和更多关于代码实践的:做任何你的框架。如果你是编码linux / QT /网络,保持一切内部在UTF-8。如果你用Windows编码,保持内部在UTF-16。如果你需要混乱的代码点,内部保持一切UTF-32。否则(对于便携式,通用代码),做任何你想要的,因为无论什么,你必须翻译一些操作系统或其他反正。


This is a really long-standing issue in my work, that I realize I still don't have a good solution to...

C naively defined all of its character test functions for an int:

int isspace(int ch);

But char's are often signed, and a full character often doesn't fit in an int, or in any single storage-unit that used for strings**.

And these functions have been the logical template for current C++ functions and methods, and have set the stage for the current standard library. In fact, they're still supported, afaict.

So if you hand isspace(*pchar) you can end up with sign extension problems. They're hard to see, and thence they're hard to guard against in my experience.

Similarly, because isspace() and it's ilk all take ints, and because the actual width of a character is often unknown w/o string-analysis - meaning that any modern character library should essentially never be carting around char's or wchar_t's but only pointers/iterators, since only by analyzing the character stream can you know how much of it composes a single logical character, I am at a bit of a loss as to how best to approach the issues?

I keep expecting a genuinely robust library based around abstracting away the size-factor of any character, and working only with strings (providing such things as isspace, etc.), but either I've missed it, or there's another simpler solution staring me in the face that all of you (who know what you're doing) use...


** These issues don't come up for fixed-sized character-encodings that can wholly contain a full character - UTF-32 apparently is about the only option that has these characteristics (or specialized environments that restrict themselves to ASCII or some such).


So, my question is:

"How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:

1) Sign expansion, and
2) variable-width character issues

After all, most character encodings are variable-width: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS. Even extended ASCII can have the simple sign-extension problem if the compiler treats char as a signed 8 bit unit.

Please note:

No matter what size your char_type is, it's wrong for most character encoding schemes.

This problem is in the standard C library, as well as in the C++ standard libraries; which still tries to pass around char and wchar_t, rather than string-iterators in the various isspace, isprint, etc. implementations.

Actually, it's precisely those type of functions that break the genericity of std::string. If it only worked in storage-units, and didn't try to pretend to understand the meaning of the storage-units as logical characters (such as isspace), then the abstraction would be much more honest, and would force us programmers to look elsewhere for valid solutions...

Thank You

Everyone who participated. Between this discussion and WChars, Encodings, Standards and Portability I have a much better handle on the issues. Although there are no easy answers, every bit of understanding helps.

解决方案

How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion
2) variable-width character issues
After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS...

Obviously, you have to use a Unicode-aware library, since you've demonstrated (correctly) that C++03 standard library is not. The C++11 library is improved, but still not quite good enough for most usages. Yes, some OS' have a 32-bit wchar_t which makes them able to correctly handle UTF32, but that's an implementation, and is not guaranteed by C++, and is not remotely sufficient for many unicode tasks, such as iterating over Graphemes (letters).

IBMICU
Libiconv
microUTF-8
UTF-8 CPP, version 1.0
utfproc
and many more at http://unicode.org/resources/libraries.html.

If the question is less about specific character testing and more about code practices in general: Do whatever your framework does. If you're coding for linux/QT/networking, keep everything internally in UTF-8. If you're coding with Windows, keep everything internally in UTF-16. If you need to mess with code points, keep everything internally in UTF-32. Otherwise (for portable, generic code), do whatever you want, since no matter what, you have to translate for some OS or other anyway.

这篇关于如何处理signed char - > int问题与标准库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆