从< cctype>调用函数是否安全?带有char参数? [英] Is it safe to call the functions from <cctype> with char arguments?
问题描述
C编程语言说<ctype.h>
中的函数遵循一个共同的要求:
The C programming language says that the functions from <ctype.h>
follow a common requirement:
ISO C99,7.4p1:
ISO C99, 7.4p1:
在所有情况下,参数均为
int
,其值应表示为unsigned char
或等于宏EOF
的值.如果该参数具有任何其他值,则行为未定义.
In all cases the argument is an
int
, the value of which shall be representable as anunsigned char
or shall equal the value of the macroEOF
. If the argument has any other value, the behavoir is undefined.
这意味着以下代码不安全:
This means that the following code is unsafe:
int upper(const char *s, size_t index) {
return toupper(s[index]);
}
如果在char
具有与signed char
相同的值空间并且字符串中包含负值的字符的实现上执行此代码,则此代码将调用未定义的行为.正确的版本是:
If this code is executed on an implementation where char
has the same value space as signed char
and there is a character with a negative value in the string, this code invokes undefined behavior. The correct version is:
int upper(const char *s, size_t index) {
return toupper((unsigned char) s[index]);
}
尽管如此,我还是在C ++中看到了许多示例,它们并不关心这种不确定行为的可能性.那么C ++标准中是否有什么保证上述代码不会导致未定义的行为,或者所有示例都是错误的?
Nevertheless I see many examples in C++ that don't care about this possibility of undefined behavior. So is there anything in the C++ standard that guarantees that the above code will not lead to undefined behavior, or are all the examples wrong?
[附加关键字:ctype cctype isalnum isalpha isblank iscntrl isdigit isgraph islowwer isprint ispunct isspace isupper isxdigit toerlower]
[Additional Keywords: ctype cctype isalnum isalpha isblank iscntrl isdigit isgraph islowwer isprint ispunct isspace isupper isxdigit tolower]
推荐答案
值得一提的是,Solaris Studio编译器(使用stlport4
)是一种在此处产生意外结果的编译器套件.编译并运行:
For what it's worth, the Solaris Studio compilers (using stlport4
) are one such compiler suite that produce an unexpected result here. Compiling and running this:
#include <stdio.h>
#include <cctype>
int main() {
char ch = '\xa1'; // '¡' in latin-1 locales + UTF-8
printf("is whitespace: %i\n", std::isspace(ch));
return 0;
}
给我:
kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out
is whitespace: 8
供参考:
$ CC -V
CC: Studio 12.5 Sun C++ 5.14 SunOS_i386 2016/05/31
当然,这种行为已记录在C ++标准中,但这绝对令人惊讶.
Of course, this behavior is as documented in the C++ standard, but it's definitely surprising.
由于已经指出上述版本由于整数溢出而在尝试分配char ch = '\xa1'
时包含未定义的行为,因此以下版本避免了这种情况,并且仍保留相同的输出:
Since it was pointed out that the above version contained undefined behavior in the attempt to assign char ch = '\xa1'
due to integer overflow, here's a version that avoids that and still retains the same output:
#include <stdio.h>
#include <cctype>
int main() {
char ch = -95;
printf("is whitespace: %i\n", std::isspace(ch));
return 0;
}
那仍然可以在我的Solaris VM上打印8:
And that does still print 8 on my Solaris VM:
kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out
is whitespace: 8
这是一个程序,在其他情况下可能看起来很不错,但由于使用std::isspace()
的UB会产生意外结果:
EDIT 2: And here's a program that might otherwise look sane but gives an unexpected result due to UB in the use of std::isspace()
:
#include <cstdio>
#include <cstring>
#include <cctype>
static int count_whitespace(const char* str, int n) {
int count = 0;
for (int i = 0; i < n; i++)
if (std::isspace(str[i])) // oops!
count += 1;
return count;
}
int main() {
const char* batman = "I am batman\xa1";
int n = std::strlen(batman);
std::printf("%i\n", count_whitespace(batman, n));
return 0;
}
然后,在我的Solaris机器上:
And, on my Solaris machine:
kevin@solaris:~/scratch
$ CC whitespace.cpp && ./a.out
3
请注意,根据您对程序的置换方式,可能会得到两个空格字符的预期结果;也就是说,几乎可以肯定,有一些编译器优化正在发挥作用,可以利用此UB更快地为您提供错误的结果.
Note that depending on how you permute this program, you'll probably get the expected result of two whitespace characters; that is, there is almost certainly some compiler optimization kicking in that takes advantage of this UB to give you the wrong result faster.
例如,如果您尝试通过搜索字符串中的(非多字节)空格字符来尝试对UTF-8字符串进行标记化,那么您会想像一下这真是令人不寒而栗.将str[i]
强制转换为unsigned char
时,这样的程序将正确运行.
You could imagine this biting you in the face if you were, for example, attempting to tokenize a UTF-8 string by searching for (non-multibyte) whitespace characters in the string. Such a program would behave correctly when casting str[i]
to unsigned char
.
这篇关于从< cctype>调用函数是否安全?带有char参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!