如何将utf-16文件逐行读取到utf-8 std :: string中 [英] How to read utf-16 file into utf-8 std::string line by line

查看:102
本文介绍了如何将utf-16文件逐行读取到utf-8 std :: string中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用需要utf8编码的std :: string变量的代码.我希望能够处理用户提供的文件,该文件可能具有utf-16编码(我在设计时不知道编码,但最终希望能够处理utf8/16/32),将其读取逐行,并将每行作为utf8编码的std :: string转发到代码的其余部分.

我有c ++ 11(确实是c ++ 11的当前MSVC子集),可以使用1.55.0.我最终将需要代码在Linux和Windows变体上都可以工作.现在,我只是在Windows 7上运行带有Visual Studio 2013 Update 4的Windows上进行原型制作.我对其他依赖关系持开放态度,但是他们需要具有已建立的跨平台(即Windows和* nix)轨道记录,并且不应为GPL/LGPL.

我一直在假设自己似乎无法找到一种验证方法,并且我的代码无法正常工作.

一个假设是,由于我最终希望将这些文件中的每一行都存储在std :: string变量中,因此我应该使用std :: ifstream填充正确构建的编解码器,以便可以转换传入的utf16流到utf8.

这个假设是否现实?我认为,另一种选择是,我必须对文本文件进行一些编码检查,然后根据结果选择wifstream/wstring或ifstream/string,这似乎比我想开始时更具吸引力. .当然,如果这是正确的(或唯一现实的)路径,那么我会接受的.

我意识到我仍然可能需要进行一些编码检测,但是就目前而言,我并不那么在乎编码检测部分,只是专注于将utf16文件内容放入utf8 std :: string中.

我尝试了各种不同的区域设置和编解码器组合,但没有一个起作用.以下是我认为可能有效但无法实现的最新化身:

void
SomeRandomClass::readUtf16LeFile( const std::string& theFileName )
{
    boost::locale::generator gen;
    std::ifstream file( theFileName );
    auto utf8Locale = gen.generate( "UTF-8" );
    std::locale cvtLocale( utf8Locale,
                           new std::codecvt_utf8_utf16<char>() );

    file.imbue( utf8Locale );
    std::string line;

    std::cout.imbue( utf8Locale );
    for ( int i = 0; i < 3; i++ )
    {
        std::getline( file, line );
        std::cout << line << std::endl;
    }
}

我在这段代码中看到的行为是,无论文件内容如何,​​每次调用getline()的结果都是一个空字符串.

如果我省略上述方法的第3行和第5行,则在相同文件的utf8编码版本上,相同的代码也可以正常工作(意味着,每个getline()调用都返回正确编码的非空字符串).

无论出于何种原因,我都无法在SO或 http://en.cppreference.com上的任何地方找到任何示例/或其他任何试图做同样事情的人.

欢迎所有想法/建议(符合上述要求).

解决方案

阅读UTF-16并编写UTF-8

您需要澄清的第一个问题是关于您是 UTF16 的什么变体阅读:

  • 是UTF-16LE(即在Windows下生成)吗?
  • 是UTF-16BE(默认由wstream生成)吗?
  • 是否为带有 BOM 的UTF16?

下一个问题是要知道您是否可以在控制台上真正输出UTF8或UTF16,知道默认的Windows控制台确实会造成麻烦.

步骤1:确保问题与Win控制台无关

因此,这里有一个小代码可以读取UTF-16LE并使用本机Windows函数检查内容(您只需在控制台应用程序中包含<windows.h>即可):

    wifstream is16(filename);
    is16.imbue(locale(is16.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>()));
    wstring wtext, wline;
    for (int i = 0; i < 10 && getline(is16, wline); i++)
        wtext += wline + L"\n";
    MessageBoxW(NULL, wtext.c_str(), L"UTF16-Little Endian", MB_OK);

如果您的文件是带有BOM的UTF-16,只需将litte_endian替换为consume_header.

第2步:将utf16字符串转换回utf8字符串

您必须使用字符串转换器:

    wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> converter;

    wifstream is16(filename);
    is16.imbue(locale(is16.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>()));
    wstring wline;
    string u8line; 
    for (int i = 0; i < 10 && getline(is16, wline); i++) {
         u8line = converter.to_bytes(wline);
         cout << u8line<<endl; 
    }

这将在Win控制台上很好地向您展示ascii角色.但是,所有utf8编码都将显示为垃圾(除非您比我将控制台设置为显示unicode字体更成功).

第3步:使用文件检查utf8编码

由于win控制台很糟糕,所以最好将直接产生的字符集写到文件中,并使用文本编辑器(lke Notepad ++)打开该文件,这样可以向您显示编码.

注意事项:所有这些操作仅使用标准库(中介MessageBoxW()除外)及其语言环境来完成.

进一步的步骤

如果要检测编码,首先要检查的是在文件的开头是否有BOM(为二进制输入打开,默认为"C"语言环境):

char bom_utf8[]{0xEF, 0xBB, 0xBF};
char bom_utf16be[] { 0xFE, 0xFF};
char bom_utf16le[] { 0xFf, 0xFe};
char bom_utf32be[] { 0, 0, 0xFf, 0xFe};
char bom_uff32le[] { 0xFf, 0xFe, 0, 0};

只需加载前几个字节,然后与该数据进行比较.

如果找到一个,就可以了.如果没有,则必须遍历该文件.

如果您希望使用西方语言,则可以快速估算出以下近似值:如果发现大量的空字节(> 25%< 50%),则可能是utf16.如果您发现超过50%的null,则可能是utf32.

但是更精确的方法可能有意义.例如,要验证文件是否为UTF16,您只需实现一个小型状态机,即可检查何时第一个字的高字节在0xD8和0xDB之间,下一个字的高字节在0xDC和0xDF之间.高低是多少,当然取决于它的尾数是大还是小.

对于 UTF8 ,这是一种类似的做法,但是状态机有点因为第一个字符的位模式定义了必须跟随的字符数量,并且每个跟随者必须具有位模式(c & 0xC0) == 0x80,所以它有点复杂.

I'm working with code that expects utf8-encoded std::string variables. I want to be able to handle a user-supplied file that potentially has utf-16 encoding (I don't know the encoding at design time, but eventually want to be able to deal with utf8/16/32), read it line-by-line, and forward each line to the rest of the code as a utf8-encoded std::string.

I have c++11 (really, the current MSVC subset of c++11) and boost 1.55.0 to work with. I'll need the code to work on both Linux and Windows variants eventually. For now, I'm just prototyping on Windows with Visual Studio 2013 Update 4, running on Windows 7. I'm open to additional dependencies, but they'd need to have an established cross-platform (meaning windows and *nix) track record, and shouldn't be GPL/LGPL.

I've been making assumptions that I don't seem to be able to find a way to validate, and I have code that is not working.

One assumption is that, since I ultimately want each line from these files in a std::string variable, I should be working with std::ifstream imbued with a properly-constructed codecvt such that the incoming utf16 stream can be converted to utf8.

Is this assumption realistic? The alternative, I thought, would be that I'd have to do some encoding checks on the text file, and then choose wifstream/wstring or ifstream/string based on the results, which seemed more unappealing than I'd like to start with. Of course, if that's the right (or the only realistic) path, I'm open to it.

I realize that I may likely need to do some encoding detection anyway, but for now, I am not so concerned about the encoding detection part, just focusing on getting utf16 file contents into utf8 std::string.

I have tried a variety of different combinations of locale and codecvt, none of which have worked. Below is the latest incarnation of what I thought might work, but doesn't:

void
SomeRandomClass::readUtf16LeFile( const std::string& theFileName )
{
    boost::locale::generator gen;
    std::ifstream file( theFileName );
    auto utf8Locale = gen.generate( "UTF-8" );
    std::locale cvtLocale( utf8Locale,
                           new std::codecvt_utf8_utf16<char>() );

    file.imbue( utf8Locale );
    std::string line;

    std::cout.imbue( utf8Locale );
    for ( int i = 0; i < 3; i++ )
    {
        std::getline( file, line );
        std::cout << line << std::endl;
    }
}

The behavior I see with this code is that the result of each call to getline() is an empty string, regardless of the file contents.

This same code works fine (meaning, each getline() call returns a correctly-encoded non-empty string) on a utf8-encoded version of the same file if I omit lines 3 and 5 of the above method.

For whatever reason, I could not find any examples anywhere here on SO or on http://en.cppreference.com/, or elsewhere in the wild, of anyone trying to do this same thing.

All ideas/suggestions (conformant to requirements above) welcome.

解决方案

Reading UTF-16 writing UTF-8

The first question you have to clarify, is about what variation of UTF16 you are reading:

  • is it UTF-16LE (i.e. generated under windows) ?
  • is it UTF-16BE (generated by wstream by default) ?
  • is it UTF16 with a BOM ?

The next question is to know whether you can really output your UTF8 or UTF16 on the console, knowing that the default windows console can really cause headakes for that.

Step 1: Make sure that the problem is no related to the win console

So here a small code to read an UTF-16LE and check the content with a native windows function (you just have to include <windows.h> in your console app):

    wifstream is16(filename);
    is16.imbue(locale(is16.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>()));
    wstring wtext, wline;
    for (int i = 0; i < 10 && getline(is16, wline); i++)
        wtext += wline + L"\n";
    MessageBoxW(NULL, wtext.c_str(), L"UTF16-Little Endian", MB_OK);

If your file is an UTF-16 with a BOM, just replace litte_endian with consume_header.

Step 2: Convert your utf16-string back into utf8 string

You have to use a string converter:

    wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> converter;

    wifstream is16(filename);
    is16.imbue(locale(is16.getloc(), new codecvt_utf16<wchar_t, 0x10ffff, little_endian>()));
    wstring wline;
    string u8line; 
    for (int i = 0; i < 10 && getline(is16, wline); i++) {
         u8line = converter.to_bytes(wline);
         cout << u8line<<endl; 
    }

This will show you the ascii caracters well on the win console. However all the utf8 encodings will appear as garbage (unless you're more successful than I for setting the console to display the unicode font).

Step 3: check the utf8 encoding using a file

As win console is pretty bad at it, the best thing would be to write the charset that you produced directly into a file and open this file with a text editor (lke Notepad++) wich can show you the encoding.

Nota bene: all this was done using only standard library (except for the intermediary MessageBoxW()) and its locale.

Further steps

If you want to detect the encoding, the first thing to start with is to see if there is a BOM, at the very begin of your file (opened for binary input, default "C" locale) :

char bom_utf8[]{0xEF, 0xBB, 0xBF};
char bom_utf16be[] { 0xFE, 0xFF};
char bom_utf16le[] { 0xFf, 0xFe};
char bom_utf32be[] { 0, 0, 0xFf, 0xFe};
char bom_uff32le[] { 0xFf, 0xFe, 0, 0};

Just load the first few bytes, and compare with this data.

If you've found one, it's ok. If not, you'll have to iterate through the file.

A quick approximation if you expect western languages, is the following: If you find a lot of null bytes (>25% <50%), it's probably utf16. If you find more than 50% of nulls, it's probably utf32.

But a more precise approach could make sense. For instance, to verify if the file is UTF16, you just have to implement a small state machine that checks that anytimes a first word has a high byte between 0xD8 and 0xDB, the next word has its high byte between 0xDC and 0xDF. What's high and what's low depend of course if it's little or big endian.

For UTF8 it's a similar practice,but the state machine is a little bit more complex because the bit pattern of the first char defines how many chars must follow, and each of the follwer must have a bit pattern (c & 0xC0) == 0x80.

这篇关于如何将utf-16文件逐行读取到utf-8 std :: string中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆