从< cctype>调用函数是否安全?带有char参数? [英] Is it safe to call the functions from <cctype> with char arguments?

查看:58
本文介绍了从< cctype>调用函数是否安全?带有char参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

C编程语言说<ctype.h>中的函数遵循一个共同的要求:

The C programming language says that the functions from <ctype.h> follow a common requirement:

ISO C99,7.4p1:

ISO C99, 7.4p1:

在所有情况下,参数均为int,其值应表示为unsigned char或等于宏EOF的值.如果该参数具有任何其他值,则行为未定义.

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavoir is undefined.

这意味着以下代码不安全:

This means that the following code is unsafe:

int upper(const char *s, size_t index) {
  return toupper(s[index]);
}

如果在char具有与signed char相同的值空间并且字符串中包含负值的字符的实现上执行此代码,则此代码将调用未定义的行为.正确的版本是:

If this code is executed on an implementation where char has the same value space as signed char and there is a character with a negative value in the string, this code invokes undefined behavior. The correct version is:

int upper(const char *s, size_t index) {
  return toupper((unsigned char) s[index]);
}

尽管如此,我还是在C ++中看到了许多示例,它们并不关心这种不确定行为的可能性.那么C ++标准中是否有什么保证上述代码不会导致未定义的行为,或者所有示例都是错误的?

Nevertheless I see many examples in C++ that don't care about this possibility of undefined behavior. So is there anything in the C++ standard that guarantees that the above code will not lead to undefined behavior, or are all the examples wrong?

[附加关键字:ctype cctype isalnum isalpha isblank iscntrl isdigit isgraph islowwer isprint ispunct isspace isupper isxdigit toerlower]

[Additional Keywords: ctype cctype isalnum isalpha isblank iscntrl isdigit isgraph islowwer isprint ispunct isspace isupper isxdigit tolower]

推荐答案

值得一提的是,Solaris Studio编译器(使用stlport4)是一种在此处产生意外结果的编译器套件.编译并运行:

For what it's worth, the Solaris Studio compilers (using stlport4) are one such compiler suite that produce an unexpected result here. Compiling and running this:

#include <stdio.h>
#include <cctype>

int main() {
    char ch = '\xa1'; // '¡' in latin-1 locales + UTF-8
    printf("is whitespace: %i\n", std::isspace(ch));
    return 0;
}

给我:

kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out 
is whitespace: 8

供参考:

$ CC -V
CC: Studio 12.5 Sun C++ 5.14 SunOS_i386 2016/05/31

当然,这种行为已记录在C ++标准中,但这绝对令人惊讶.

Of course, this behavior is as documented in the C++ standard, but it's definitely surprising.

由于已经指出上述版本由于整数溢出而在尝试分配char ch = '\xa1'时包含未定义的行为,因此以下版本避免了这种情况,并且仍保留相同的输出:

Since it was pointed out that the above version contained undefined behavior in the attempt to assign char ch = '\xa1' due to integer overflow, here's a version that avoids that and still retains the same output:

#include <stdio.h>
#include <cctype>

int main() {
    char ch = -95;
    printf("is whitespace: %i\n", std::isspace(ch));
    return 0;
}

那仍然可以在我的Solaris VM上打印8:

And that does still print 8 on my Solaris VM:

kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out 
is whitespace: 8


这是一个程序,在其他情况下可能看起来很不错,但由于使用std::isspace()的UB会产生意外结果:


EDIT 2: And here's a program that might otherwise look sane but gives an unexpected result due to UB in the use of std::isspace():

#include <cstdio>
#include <cstring>
#include <cctype>

static int count_whitespace(const char* str, int n) {
    int count = 0;
    for (int i = 0; i < n; i++)
        if (std::isspace(str[i]))  // oops!
            count += 1;
    return count;
}

int main() {
    const char* batman = "I am batman\xa1";
    int n = std::strlen(batman);
    std::printf("%i\n", count_whitespace(batman, n));
    return 0;
}

然后,在我的Solaris机器上:

And, on my Solaris machine:

kevin@solaris:~/scratch
$ CC whitespace.cpp && ./a.out
3

请注意,根据您对程序的置换方式,可能会得到两个空格字符的预期结果;也就是说,几乎可以肯定,有一些编译器优化正在发挥作用,可以利用此UB更快地为您提供错误的结果.

Note that depending on how you permute this program, you'll probably get the expected result of two whitespace characters; that is, there is almost certainly some compiler optimization kicking in that takes advantage of this UB to give you the wrong result faster.

例如,如果您尝试通过搜索字符串中的(非多字节)空格字符来尝试对UTF-8字符串进行标记化,那么您会想像一下这真是令人不寒而栗.将str[i]强制转换为unsigned char时,这样的程序将正确运行.

You could imagine this biting you in the face if you were, for example, attempting to tokenize a UTF-8 string by searching for (non-multibyte) whitespace characters in the string. Such a program would behave correctly when casting str[i] to unsigned char.

这篇关于从&lt; cctype&gt;调用函数是否安全?带有char参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆