为什么没有"unsigned wchar_t"?和"signed wchar_t"类型? [英] Why there are no "unsigned wchar_t" and "signed wchar_t" types?

查看:210
本文介绍了为什么没有"unsigned wchar_t"?和"signed wchar_t"类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

char的签名未标准化.因此,有signed charunsigned char类型.因此,使用单个字符的函数必须使用既可以包含有符号字符又可以包含无符号字符的参数类型(此 类型选择为int),因为如果参数类型为char,我们将 在这样的代码中从编译器(如果使用-Wconversion)获取类型转换警告:

char c = 'ÿ';
if (islower((unsigned char) c)) ...

warning: conversion to ‘char’ from ‘unsigned char’ may change the sign of the result

(这里我们考虑如果islower()的参数类型为char )

无需显式类型转换即可工作的是自动升级 从charint.

此外,引入了wchar_t的ISO C90标准并没有说什么 关于wchar_t的表示形式.

glibc参考中的一些引用:

wchar_t定义为char

是合法的

如果将wchar_t定义为char,则由于参数升级,必须将类型wint_t定义为int.

因此,wchar_t可以很好地定义为char,这意味着类似的规则 对于宽字符类型必须适用,即在某些情况下 wchar_t为正,并且可能存在wchar_t为负的实现. 由此可见,必须存在unsigned wchar_tsigned wchar_t类型(出于与unsigned charsigned char类型相同的原因).

私人通信表明允许实现支持广泛 仅具有> = 0值的字符(独立于wchar_t的符号).有人知道这意味着什么吗?细是否表示wchar_t是16位时 类型(例如),我们只能使用15位来存储宽字符的值吗? 换句话说,以符号扩展的wchar_t是有效值是真的吗? 另请参见此问题.

此外,私人通讯显示该标准要求wchar_t的任何有效值必须 用wint_t表示.是真的吗?

请考虑以下示例:

#include <locale.h>
#include <ctype.h>
int main (void)
{
  setlocale(LC_CTYPE, "fr_FR.ISO-8859-1");

  /* 11111111 */
  char c = 'ÿ';

  if (islower(c)) return 0;
  return 1;
}

要使其可移植,我们需要将其强制转换为'(unsigned char)'. 这是必需的,因为char可能等效于signed char, 在这种情况下,设置最高位的字节将是符号 转换为int时扩展,产生的值超出 unsigned char的范围.

现在,为什么这种情况与以下示例不同? 宽字符?

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "");
  wchar_t wc = L'ÿ';

  if (iswlower(wc)) return 0;
  return 1;
}

我们需要在这里使用iswlower((unsigned wchar_t)wc),但是 没有unsigned wchar_t类型.

为什么没有unsigned wchar_tsigned wchar_t类型?

更新

标准是否保证在以下两个程序中强制转换为unsigned intint是正确的? (我只是将wint_twchar_t替换为它们在glibc中的实际含义)

#include <locale.h>
#include <wchar.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  unsigned int wc;
  wc = getwchar();
  putwchar((int) wc);
}

-

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  int wc;
  wc = L'ÿ';
  if (iswlower((unsigned int) wc)) return 0;
  return 1;
}

解决方案

TL; DR:

为什么没有未签名的wchar_t和已签名的wchar_t类型?

因为C的宽字符处理功能被定义为不需要它们.


更详细地

char的签名不规范.

确切地说,该实现应将char定义为具有与签名char或未签名char相同的范围,表示形式和行为." (C2011,6.2.5/15)

因此有signed charunsigned char类型.

因此"表示因果关系,这很难辩驳,但是当您要处理数字而不是字符时,肯定signed charunsigned char更合适.

因此,使用单个字符的函数必须使用既可以包含有符号字符又可以包含无符号字符的参数类型

不,一点也不.可以使用类型char轻松定义使用单个字符的标准库函数,而不管该类型是否已签名,因为库实现确实知道其签名.如果这是一个问题,那么它也将同样适用于字符串函数-char将毫无用处.

您的getchar()示例不正确.它返回int而不是字符类型,因为它需要能够返回与任何字符都不对应的错误指示符.此外,您提供的代码与随附的警告消息不对应:它包含从intunsigned char的转换,但是没有从charunsigned char的转换.

某些其他字符处理函数接受int参数或返回类型为int的值,都是为了与getchar()和其他stdio函数兼容,以及出于历史原因.在过去的日子里,您实际上根本无法传递char -它总是会被提升为int,这就是函数将(并且必须)接受的.尽管语言不断发展,但后来人们再也无法更改参数类型.

此外,引入了wchar_t的ISO C90标准没有对wchar_t的表示进行任何具体说明.

C90不再具有实际意义,但毫无疑问,它与C2011(7.19/2)非常相似,后者将wchar_t描述为

一种整数类型,其值的范围可以表示受支持的语言环境中指定的最大扩展字符集的所有成员的不同代码.

您对glibc参考的引用是非权威性的,只有glibc可能除外.在任何情况下,它们似乎都是注释,而不是说明,并且不清楚为什么要提出它们.当然,至少第一个是正确的.参考该标准,如果给定实现所支持的语言环境中指定的最大扩展字符集的所有成员都可以放入char中,则该实现可以将wchar_t定义为char.此类实现曾经比今天更加普遍.

您问几个问题:

私人通信显示,允许实现仅支持具有> = 0值的宽字符(独立于wchar_t的符号).有人知道这意味着什么吗?

我认为这意味着与您进行交流的任何人都不知道他们在说什么,或者也许他们在说什么与C标准提出的要求有所不同.您会发现,在实践中,字符集仅由非负字符代码定义,但这不是C标准所施加的约束.

例如,当wchar_t是16位类型时,thin是否意味着我们只能使用15位来存储宽字符的值?

C标准没有说明或暗示.您可以将任何支持的字符的值存储在wchar_t中.特别是,如果实现支持包含超过32767个字符代码的字符集,则可以将其存储在wchar_t中.

换句话说,以符号扩展的wchar_t是有效值是真的吗?

C标准没有说明或暗示.甚至都没有说wchar_t是否是带符号的类型(如果不是,则符号扩展对其没有意义).如果它是带符号的类型,则不能保证符号扩展表示某个受支持字符集中的字符的值(该值原则上可以为负)是否会产生一个也表示该字符中的字符的值集,或使用任何其他受支持的字符集.将wchar_t值加1也是一样.

此外,私人通信显示该标准要求wchar_t的任何有效值都必须由wint_t表示.是真的吗?

这取决于您所说的有效".标准说wint_t

是默认情况下不变的整数类型,其参数提升可以保存与扩展字符集的成员相对应的任何值,以及至少一个与扩展字符集的任何成员都不相对应的值.

(C2011,7.29.1/2)

wchar_t必须能够在任何受支持的语言环境中保存与扩展字符集的成员相对应的任何值. wint_t也必须能够保留所有这些值.但是,wchar_t可能能够表示与任何支持的字符集中的任何字符都不对应的值.从类型可以表示它们的意义上说,这样的值是有效的.不需要wint_t就能表示这样的值.

例如,如果任何受支持的语言环境中最大的扩展字符集使用的字符代码最多但不超过32767,则实现可以自由地将wchar_t实现为无符号的16位整数,而将wint_t实现为有符号的16位整数.用wchar_t表示的不对应于扩展字符的值就不能用wint_t表示(但是wint_t对于其所需值仍然有许多候选值,而这些字符不对应任何字符).

关于字符和宽字符分类功能,唯一的答案是差异仅源于不同的规格. char分类函数被定义为使用与getchar()定义要返回的值相同的值--1或字符值(如果需要)转换为unsigned char.另一方面,宽字符分类函数接受类型为wint_t的参数,该参数可以表示所有宽字符的值保持不变,因此不需要进行转换.

您在这方面声称

我们需要在这里使用iswlower((unsigned wchar_t)wc),但是没有unsigned wchar_t类型.

不,也许.您无需将wchar_t参数转换为iswlower()的任何其他类型,特别是,您无需将其转换为显式的无符号类型.宽字符分类功能在这方面与常规字符分类功能不相似,这是事后考虑的.至于unsigned wchar_t,C不需要存在这种类型,因此可移植代码不应使用它,但在某些实现中它可能存在.


关于问题后附的更新:

标准是否保证在以下两个程序中强制转换为unsigned int和int是正确的? (我只是将wint_t和wchar_t替换为它们在glibc中的实际含义)

该标准在一般情况下并没有说明符合标准的实现.但是,我想假设您的意思是具体询问wchar_tintwint_tunsigned int的符合标准的实现.

在这样的实现上,您的第一个程序存在缺陷,因为它没有考虑getwchar()返回WEOF的可能性.如果将WEOF转换为类型wchar_t,如果这样做不会引起信号升高,则不能保证产生与任何宽字符相对应的值.因此,将这种转换的结果传递给putwchar()不会表现出已定义的行为.此外,如果WEOF定义为与UINT_MAX相同的值(不能用int表示),则将该值转换为int具有独立于putwchar()调用的实现定义的行为./p>

另一方面,我认为您遇到的关键问题是,如果在第一个程序中getwchar()返回的值不是WEOF,那么通过转换为,可以保证该值不变wchar_t.在这种情况下,您的第一个程序将按预期执行,但不必强制转换为int(或wchar_t).

类似地,如果宽字符文字对应于适用的扩展字符集中的字符,则第二个程序是正确的,但是强制类型转换是不必要的,并且不进行任何更改.保证此类文字的wchar_t值可以由类型wint_t表示,因此强制转换会更改其操作数的类型,但不能更改该值. (但是,如果文字与扩展字符集中的字符不对应,则行为是实现定义的.)

另一方面,如果您的目标是编写严格符合要求的代码,那么正确的做法以及这些特殊的宽字符函数的预期使用方式将是:

#include <locale.h>
#include <wchar.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  wint_t wc = getwchar();
  if (wc != WEOF) {
    // No cast is necessary or desirable
    putwchar(wc);
  }
}

和这个:

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  wchar_t wc = L'ÿ';
  // No cast is necessary or desirable
  if (iswlower(wc)) return 0;
  return 1;
}

The signedness of char is not standardized. Hence there are signed char and unsigned char types. Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char (this type was chosen to be int), because if the argument type was char, we would get type conversion warnings from the compiler (if -Wconversion is used) in code like this:

char c = 'ÿ';
if (islower((unsigned char) c)) ...

warning: conversion to ‘char’ from ‘unsigned char’ may change the sign of the result

(here we consider what would happen if the argument type of islower() was char)

And the thing which makes it work without explicit typecasting is automatic promotion from char to int.

Further, the ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation of wchar_t.

Some quotations from glibc reference:

it would be legitimate to define wchar_t as char

if wchar_t is defined as char the type wint_t must be defined as int due to the parameter promotion.

So, wchar_t can perfectly well be defined as char, which means that similar rules for wide character types must apply, i.e., there may be implementations where wchar_t is positive, and there may be implementations where wchar_t is negative. From this it follows that there must exist unsigned wchar_t and signed wchar_t types (for the same reason as there are unsigned char and signed char types).

Private communication reveals that an implementation is allowed to support wide characters with >=0 value only (independently of signedness of wchar_t). Anybody knows what this means? Does thin mean that when wchar_t is 16-bit type (for example), we can only use 15 bits to store the value of wide character? In other words, is it true that a sign-extended wchar_t is a valid value? See also this question.

Also, private communication reveals that the standard requires that any valid value of wchar_t must representable by wint_t. Is it true?

Consider this example:

#include <locale.h>
#include <ctype.h>
int main (void)
{
  setlocale(LC_CTYPE, "fr_FR.ISO-8859-1");

  /* 11111111 */
  char c = 'ÿ';

  if (islower(c)) return 0;
  return 1;
}

To make it portable, we need the cast to '(unsigned char)'. This is necessary because char may be the equivalent signed char, in which case a byte where the top bit is set would be sign extended when converting to int, yielding a value that is outside the range of unsigned char.

Now, why is this scenario different from the following example for wide characters?

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "");
  wchar_t wc = L'ÿ';

  if (iswlower(wc)) return 0;
  return 1;
}

We need to use iswlower((unsigned wchar_t)wc) here, but there is no unsigned wchar_t type.

Why there are no unsigned wchar_t and signed wchar_t types?

UPDATE

Are the standards saying that casting to unsigned int and to int in the following two programs is guaranteed to be correct? (I just replaced wint_t and wchar_t to their actual meaning in glibc)

#include <locale.h>
#include <wchar.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  unsigned int wc;
  wc = getwchar();
  putwchar((int) wc);
}

--

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  int wc;
  wc = L'ÿ';
  if (iswlower((unsigned int) wc)) return 0;
  return 1;
}

解决方案

TL;DR:

Why there are no unsigned wchar_t and signed wchar_t types?

Because C's wide-character handling facilities were defined such that they are not needed.


In more detail,

The signedness of char is not standardized.

To be precise, "The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char." (C2011, 6.2.5/15)

Hence there are signed char and unsigned char types.

"Hence" implies causation, which would be hard to argue clearly, but certainly signed char and unsigned char are more appropriate when you want to handle numbers, as opposed to characters.

Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char

No, not at all. Standard library functions that work with individual characters could easily be defined in terms of type char, regardless of whether that type is signed, because the library implementation does know its signedness. If that were a problem then it would apply equally to the string functions, too -- char would be useless.

Your example of getchar() is non-apposite. It returns int rather than a character type because it needs to be able to return an error indicator that does not correspond to any character. Moreover, the code you present does not correspond to the accompanying warning message: it contains a conversion from int to unsigned char, but no conversion from char to unsigned char.

Some other character-handling functions accept int parameters or return values of type int both for compatibility with getchar() and other stdio functions, and for historic reasons. In days of yore, you couldn't actually pass a char at all -- it would always be promoted to int, and that is what the functions would (and must) accept. One cannot later change the argument type, evolution of the language notwithstanding.

Further, the ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation of wchar_t.

C90 isn't really relevant any longer, but no doubt it says something very similar to C2011 (7.19/2), which describes wchar_t as

an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales [...].

Your quotations from the glibc reference are non-authoritative, except possibly for glibc only. They appear in any case to be commentary, not specification, and its unclear why you raise them. Certainly, though, at least the first is correct. Referring to the standard, if all the members of the largest extended character set specified among the locales supported by a given implementation could fit in a char then that implementation could define wchar_t as char. Such implementations used to be much more common than they are today.

You ask several questions:

Private communication reveals that an implementation is allowed to support wide characters with >=0 value only (independently of signedness of wchar_t). Anybody knows what this means?

I think it means that whoever communicated that to you doesn't know what they are talking about, or perhaps that what they are talking about is something different than the requirements placed by the C standard. You will find that in practice, character sets are defined with only non-negative character codes, but that is not a constraint placed by the C standard.

Does thin mean that when wchar_t is 16-bit type (for example), we can only use 15 bits to store the value of wide character?

The C standard does not say or imply that. You can store the value of any supported character in a wchar_t. In particular, if an implementation supports a character set containing character codes exceeding 32767, then you can store those in a wchar_t.

In other words, is it true that a sign-extended wchar_t is a valid value?

The C standard does not say or imply that. It does not even say whether wchar_t is a signed type (if not, then sign extension is meaningless for it). If it is a signed type, then there is no guarantee about whether sign-extending a value representing a character in some supported character set (which value could, in principle, be negative) will produce a value that also represents a character in that character set, or in any other supported character set. The same is true of adding 1 to a wchar_t value.

Also, private communication reveals that the standard requires that any valid value of wchar_t must representable by wint_t. Is it true?

It depends what you mean by "valid". The standard says that wint_t

is an integer type unchanged by default argument promotions that can hold any value corresponding to members of the extended character set, as well as at least one value that does not correspond to any member of the extended character set.

(C2011, 7.29.1/2)

wchar_t must be able to hold any value corresponding to a member of the extended character set, in any supported locale. wint_t must be able to hold all of those values, too. It may be, however, that wchar_t is capable of representing values that do not correspond to any character in any supported character set. Such values are valid in the sense that the type can represent them. wint_t is not required to be able to represent such values.

For example, if the largest extended character set of any supported locale uses character codes up to but not exceeding 32767, then an implementation would be free to implement wchar_t as an unsigned 16-bit integer, and wint_t as a signed 16-bit integer. The values representable by wchar_t that do not correspond to extended characters are then not representable by wint_t (but wint_t still has many candidates for its required value that does not correspond to any character).

With respect to the character and wide-character classification functions, the only answer is that the differences simply arise from different specifications. The char classification functions are defined to work with the same values that getchar() is defined to return -- either -1 or a character value converted, if necessary, to unsigned char. The wide character classification functions, on the other hand, accept arguments of type wint_t, which can represent the values of all wide-character unchanged, therefore there is no need for a conversion.

You claim in this regard that

We need to use iswlower((unsigned wchar_t)wc) here, but there is no unsigned wchar_t type.

No and maybe. You do not need to convert the wchar_t argument to iswlower() to any other type, and in particular, you do not need to convert it to an explicitly unsigned type. The wide character classification functions are not analogous to the regular character classification functions in this respect, having been designed with the benefit of hindsight. As for unsigned wchar_t, C does not require such a type to exist, so portable code should not use it, but it may exist in some implementations.


Regarding the update appended to the question:

Are the standards saying that casting to unsigned int and to int in the following two programs is guaranteed to be correct? (I just replaced wint_t and wchar_t to their actual meaning in glibc)

The standard says nothing of the sort about conforming implementations in general. I'll suppose, however, that you mean to ask specifically about conforming implementations for which wchar_t is int and wint_t is unsigned int.

On such an implementation, your first program is flawed because it does not account for the possibility that getwchar() returns WEOF. Converting WEOF to type wchar_t, if doing so does not cause a signal to be raised, is not guaranteed to produce a value that corresponds to any wide character. Passing the result of such a conversion to putwchar() therefore does not exhibit defined behavior. Moreover, if WEOF is defined with the same value as UINT_MAX (which is not representable by int) then the conversion of that value to int has implementation-defined behavior independently of the putwchar() call.

On the other hand, I think the key point you are struggling with is that if the value returned by getwchar() in the first program is not WEOF, then it is guaranteed to be one that is unchanged by conversion to wchar_t. Your first program will perform as appears to be intended in that case, but the cast to int (or wchar_t) is unnecessary.

Similarly, the second program is correct provided that the wide-character literal corresponds to a character in the applicable extended character set, but the cast is unnecessary and changes nothing. The wchar_t value of such a literal is guaranteed to be representable by type wint_t, so the cast changes the type of its operand, but not the value. (But if the literal does not correspond to a character in the extended character set then the behavior is implementation-defined.)

On the third hand, if your objective is to write strictly-conforming code then the right thing to do, and indeed the intended usage mode of these particular wide-character functions, would be this:

#include <locale.h>
#include <wchar.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  wint_t wc = getwchar();
  if (wc != WEOF) {
    // No cast is necessary or desirable
    putwchar(wc);
  }
}

and this:

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  wchar_t wc = L'ÿ';
  // No cast is necessary or desirable
  if (iswlower(wc)) return 0;
  return 1;
}

这篇关于为什么没有"unsigned wchar_t"?和"signed wchar_t"类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆