如何在 C++ 中使用 UTF-8,从其他编码转换为 UTF-8 [英] How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8

查看:78
本文介绍了如何在 C++ 中使用 UTF-8,从其他编码转换为 UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道如何解决这个问题:

I don't know how to solve that:

想象一下,我们有 4 个网站:

Imagine, we have 4 websites:

  • A:UTF-8
  • B:ISO-8859-1
  • C:ASCII
  • D:UTF-16

我用 C++ 编写的程序执行以下操作:下载一个网站并对其进行解析.但它必须了解内容.我的问题不是用 ASCII 字符(如 ">""<")完成的解析.

My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">" or "<".

问题是程序应该从网站的文本中找出所有的词.单词是字母数字字符的任意组合.然后我将这些词发送到服务器.数据库和 Web 前端使用 UTF-8.所以我的问题是:

The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters. Then I send these words to a server. The database and the web-frontend are using UTF-8. So my questions are:

  • 如何将任何"(或最常用的)字符编码转换为 UTF-8?
  • 如何在 C++ 中使用 UTF-8 字符串?我认为 wchar_t 不起作用,因为它有 2 个字节长.UTF-8 中的代码点最长为 4 个字节...
  • 是否有isspace()isalnum()strlen()tolower()等函数对于这样的 UTF-8 字符串?
  • How can I convert "any" (or the most used) character encoding to UTF-8?
  • How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...
  • Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?

请注意:我在 C++ 中不做任何输出(如 std::cout).只需过滤掉单词并将它们发送到服务器即可.

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

我知道 UTF8-CPP 但它没有 is*() 函数.正如我所读到的,它不会从其他字符编码转换为 UTF-8.仅从 UTF-* 到 UTF-8.

I know about UTF8-CPP but it has no is*() functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.

我忘了说,程序必须是可移植的:Windows、Linux ......

I forgot to say, that the program has to be portable: Windows, Linux, ...

推荐答案

如何将任何"(或最常用的)字符编码转换为 UTF-8?

How can I convert "any" (or the most used) character encoding to UTF-8?

ICU(Unicode 国际组件)是这里的解决方案.它通常被认为是对 Unicode 支持的最后发言权.当涉及到 Unicode 时,甚至 Boost.Locale 和 Boost.Regex 也使用它.请参阅我对 Dory Zidon 的回答的评论,了解为什么我建议直接使用 ICU,而不是包装器(如 Boost).

ICU (International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).

您为给定的编码创建一个转换器...

You create a converter for a given encoding...

#include <ucnv.h>

UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
    // ...
    ucnv_close( converter );
}

...然后使用 UnicodeString 类作为适当的.

...and then use the UnicodeString class as appripriate.

我认为 wchar_t 不起作用,因为它有 2 个字节长.

I think wchar_t does not work because it is 2 bytes long.

wchar_t 的大小是实现定义的.AFAICR,Windows 是 2 字节(UCS-2/UTF-16,取决于 Windows 版本),Linux 是 4 字节(UTF-32).在任何情况下,由于标准没有定义wchar_t 的 Unicode 语义,因此使用它是不可移植的猜测.别猜了,用ICU.

The size of wchar_t is implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't define Unicode semantics for wchar_t, using it is non-portable guesswork. Don't guess, use ICU.

对于这种 UTF-8 字符串,是否有像 isspace()、isalnum()、strlen()、tolower() 这样的函数?

Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?

不是在他们的 UTF-8 编码中,但无论如何你都不会在内部使用它.UTF-8 适用于外部表示,但内部 UTF-16 或 UTF-32 是更好的选择.Unicode 代码点(即 UChar32)确实存在上述函数;参考uchar.h.

Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.

请注意:我在 C++ 中不做任何输出(如 std::cout).只需过滤掉单词并将它们发送到服务器即可.

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

检查 BreakIterator.

我忘了说,程序必须是可移植的:Windows、Linux ......

I forgot to say, that the program has to be portable: Windows, Linux, ...

如果我还没有说过,确实使用 ICU,并为自己省去很多麻烦.即使乍一看似乎有点重量级,但它是目前最好的实现,它非常便携(我自己在 Windows、Linux 和 AIX 上使用它),并且您在以后的项目中一次又一次地使用它,因此不会浪费在学习其 API 上的时间.

In case I haven't said it already, do use ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it is the best implementation out there, it is extremely portable (using it on Windows, Linux, and AIX myself), and you will use it again and again and again in projects to come, so time invested in learning its API is not wasted.

这篇关于如何在 C++ 中使用 UTF-8,从其他编码转换为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆