C ++ UTF-16到char转换(Linux / Ubuntu) [英] C++ UTF-16 to char conversion (Linux/Ubuntu)

查看:217
本文介绍了C ++ UTF-16到char转换(Linux / Ubuntu)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图帮助一个朋友的项目,应该是1H,现在已经3天。不用说,我感到非常沮丧和愤怒;-) ooooouuuu ...我呼吸。

I am trying to help a friend with a project that was supposed to be 1H and has been now 3 days. Needless to say I feel very frustrated and angry ;-) ooooouuuu... I breath.

所以用C ++编写的程序只是读取一堆文件并处理它们。问题是,我的程序读取使用UTF-16编码的文件(因为文件包含用不同语言编写的单词)和一个简单的使用ifstream似乎不工作(它读取和输出垃圾)。我花了一段时间才意识到这是因为文件是在UTF-16。

So the program written in C++ just read a bunch of file and process them. The problem is that my program reads files which are using a UTF-16 encoding (because the files contain words written in different languages) and a simple use to ifstream just doesn't seem to work (it reads and outputs garbage). It took me a while to realise that this was because the files were in UTF-16.

现在我花了整个下午在网上试图找到关于READING的信息UTF16文件并将UTF16行的内容转换为char!我只是不能似乎!这是一场噩梦。我尝试了解我从未使用的< locale> < codecvt> ,wstring等之前(我专门从事图形应用程序,而不是桌面应用程序)。我只是不能得到它。

Now I spent literally the whole afternoon on the web trying to find info about READING UTF16 files and converting the content of a UTF16 line to char! I just can't seem to! It's a nightmare. I try to learn about <locale> and <codecvt>, wstring, etc. which I have never used before (I am specialised in graphics apps, not desktop apps). I just can't get it.

这是我做了这样的费用(但不工作):

This is what I have done so fare (but doesn't work):

std::wifstream file2(fileFullPath);
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>);
std::cout.imbue(loc);
while (!file2.eof()) {
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl;
}



这是我可以想出的最大值,但它甚至不工作。它不做任何更好的。但问题是,我不明白我在做什么在第一个地方。

That's the maximum I could come up with but it doesn't even work. And it doesn't do anything better. But the problem is that I don't understand what I am doing in the first place anyway.

请帮助!这真的是疯狂的,我甚至可以读一个G *** D ***文本文件。

SO PLEASE PLEASE HELP! This is really driving crazy that I can even read a G*** D*** text file.

在上面,我的朋友使用Ubuntu(我使用clang ++)这个代码需要-stdlib = libc ++似乎并不支持gcc在他的身边(即使他使用的gcc的相当高级版本,这是4.6.3我相信)。所以我甚至不确定使用codecvt和locale是一个好主意(如在可能)。会有更好的(另一个)选项。

On top, my friend uses Ubuntu (I use clang++) and this code needs -stdlib=libc++ which doesn't seem to be supported by gcc on his side (even though he uses a pretty advanced version of gcc, which is 4.6.3 i believe). So I am not even sure using codecvt and locale is a good idea (as in "possible"). Would there be a better (another) option.

如果我将所有的文件转换为utf-8只是从命令行(使用linux命令)可能会丢失信息?

If I convert all the files to utf-8 just from the command line (using a linux command) am I going to potentially lose information?

非常感谢,如果你帮助我,我将非常感谢你。

Thank a lot, I will ever be grateful to you if you help me on this.

推荐答案


如果我只是从命令行(使用linux命令)将所有文件转换为utf-8我将潜在地丢失信息? p>

If I convert all the files to utf-8 just from the command line (using a linux command) am I going to potentially lose information?

不,所有UTF-16数据都可以无损地转换为UTF-8。这可能是最好的事情。

No, all UTF-16 data can be losslessly converted to UTF-8. This is probably the best thing to do.

当引入宽字符时,内部到程序,并且从不作为宽字符写入磁盘。宽流通过将您输出的宽字符转换为输出文件中的窄字符,并在读取时将文件中的窄字符转换为内存中的宽字符来反映这一点。

When wide characters were introduced they were intended to be a text representation used exclusively internal to a program, and never written to disk as wide characters. The wide streams reflect this by converting the wide characters you write out to narrow characters in the output file, and converting narrow characters in a file to wide characters in memory when reading.

std::wofstream wout("output.txt");
wout << L"Hello"; // the output file will just be ASCII (assuming the platform uses ASCII).

std::wifstream win("ascii.txt");
std::wstring s;
wout >> s; // the ascii in the file is converted to wide characters.

当然,实际编码取决于 codecvt facet in the stream's imbued locale,but what the stream does is using the codecvt wchar_t 转换为 char 转换为 wchar_t

Of course the actual encoding depends on the codecvt facet in the stream's imbued locale, but what the stream does is use the codecvt to convert from wchar_t to char using that facet when writing, and convert from char to wchar_t when reading.

但是,由于有些人开始用UTF-16编写文件,其他人只需要处理它。他们用C ++流做的方式是通过创建 codecvt facet,将 char 作为持有半个UTF-16代码单元,这是 codecvt_utf16

However since some people started writing files out in UTF-16 other people have just had to deal with it. The way they do that with C++ streams is by creating codecvt facets that will treat char as holding half a UTF-16 code unit, which is what codecvt_utf16 does.

所以说明,这里是你的代码的问题:

So with that explaination, here are the problems with your code:

std::wifstream file2(fileFullPath); // UTF-16 has to be read in binary mode
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); // do you really want char32_t data? or do you want wchar_t?
std::cout.imbue(loc); // You're not even using cout, so why are you imbuing it?
// You need to imbue file2 here, not cout.
while (!file2.eof()) { // Aside from your UTF-16 question, this isn't the usual way to write a getline loop, and it doesn't behave quite correctly
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl; // wcout is not imbued with a locale that will correctly display the original UTF-16 data
}

以下是重写上述内容的一种方法:

Here's one way to rewrite the above:

// when reading UTF-16 you must use binary mode
std::wifstream file2(fileFullPath, std::ios::binary);

// ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux)
static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough");

// imbue file2 so that it will convert a UTF-16 file into wchar_t data.
// If the UTF-16 files are generated on Windows then you probably want to
// consume the BOM Windows uses
std::locale loc(
    std::locale(),
    new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>);
file2.imbue(loc);

// imbue wcout so that wchar_t data printed will be converted to the system's
// encoding (which is probably UTF-8).
std::wcout.imbue(std::locale(""));

// Note that the above is doing something that one should not do, strictly
// speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`,
// UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding
// used in other locales such as std::locale(""). Fortunately locales that use
// UTF-8 as the narrow encoding will generally also use UTF-32 as the wide
// encoding, coincidentally making this code work

std::wstring line;
while (std::getline(file2, line)) {
  std::wcout << line << std::endl;
}

这篇关于C ++ UTF-16到char转换(Linux / Ubuntu)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆