C / C ++ UTF-8大/小写转换 [英] C / C++ UTF-8 upper/lower case conversions

查看:251
本文介绍了C / C ++ UTF-8大/小写转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:
有与一台机器上工作,并未能对其他(下面详细介绍),相应的测试用例的方法。我认为有一些错误的code,导致它在一台机器上工作的机会。不幸的是,我不能发现问题。

The Problem: There is a method with a corresponding test-case that works on one machine and fails on the other (details below). I assume there's something wrong with the code, causing it to work by chance on the one machine. Unfortunately I cannot find the problem.

请注意,性病::字符串和UTF-8编码的使用是要求我有没有真正的影响力。使用C ++的方法是完全正常的,但不幸的是我没能发现任何东西。因此,使用C-功能。

Please note that the usage of std::string and utf-8 encoding are requirements I have no real influence on. Using C++ methods would be totally fine, but unfortunately I failed to find anything. Hence the use of C-functions.

方法:

std::string firstCharToUpperUtf8(const string& orig) {
  std::string retVal;
  retVal.reserve(orig.size());
  std::mbstate_t state = std::mbstate_t();
  char buf[MB_CUR_MAX + 1];
  size_t i = 0;
  if (orig.size() > 0) {
    if (orig[i] > 0) {
      retVal += toupper(orig[i]);
      ++i;
    } else {
      wchar_t wChar;
      int len = mbrtowc(&wChar, &orig[i], MB_CUR_MAX, &state);
      // If this assertion fails, there is an invalid multi-byte character.
      // However, this usually means that the locale is not utf8.
      // Note that the default locale is always C. Main classes need to set them
      // To utf8, even if the system's default is utf8 already.
      assert(len > 0 && len <= static_cast<int>(MB_CUR_MAX));
      i += len;
      int ret = wcrtomb(buf, towupper(wChar), &state);
      assert(ret > 0 && ret <= static_cast<int>(MB_CUR_MAX));
      buf[ret] = 0;
      retVal += buf;
    }
  }
  for (; i < orig.size(); ++i) {
    retVal += orig[i];
  }
  return retVal;
}

测试:

TEST(StringUtilsTest, firstCharToUpperUtf8) {
  setlocale(LC_CTYPE, "en_US.utf8");
  ASSERT_EQ("Foo", firstCharToUpperUtf8("foo"));
  ASSERT_EQ("Foo", firstCharToUpperUtf8("Foo"));
  ASSERT_EQ("#foo", firstCharToUpperUtf8("#foo"));
  ASSERT_EQ("ßfoo", firstCharToUpperUtf8("ßfoo"));
  ASSERT_EQ("Éfoo", firstCharToUpperUtf8("éfoo"));
  ASSERT_EQ("Éfoo", firstCharToUpperUtf8("Éfoo"));
}

失败的测试(只发生在两台机器之一):

The failed test (only happens on one of two machines):

Failure
Value of: firstCharToUpperUtf8("ßfoo")
  Actual: "\xE1\xBA\x9E" "foo"
Expected: "ßfoo"

这两个机安装的语言环境en_US.utf8。然而,他们使用不同版本的libc。它的工作原理与GLIBC_2.14独立它被编译并在另一台机器上不能正常工作,而它只能有编制的机器上,否则它缺乏正确的libc版本。

Both machine have the locale en_US.utf8 installed. They however use different versions of libc. It works on the machine with GLIBC_2.14 independent of where it was compiled and doesn't work on the other machine, while it can only be compiled there, because otherwise it lacks the proper libc version.

无论哪种方式,存在编译此code和同时失败运行它的机器。必须有一些错误的code和我不知道。指向C ++的方法(特别是STL),也将是巨大的。升压和其他图书馆应该避免由于其他外部的要求。

Either way, there is a machine that compiles this code and runs it while it fails. There has to be something wrong with the code and I wonder what. Pointing to C++ methods (STL in particular), would also be great. Boost and other libraries should be avoided due to other outside requirements.

推荐答案

小case声s:SS;大写的sharp s:ẞ。你有没有在你的断言使用大写版本?
好像glibg 2.14如下农具pre UNI code5.1尖锐s以上无大写版本,而另一台机器上的libc中使用UNI code 5.1ẞ= U1E9E ...

small case sharp s : ß; upper case sharp s : ẞ. Did you use the uppercase version in your assert ? Seems like glibg 2.14 follows implements pre unicode5.1 no upper case version of sharp s, and on the other machine the libc uses unicode 5.1 ẞ=U1E9E ...

这篇关于C / C ++ UTF-8大/小写转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆