在 C++ 中读取 UTF-16 文件 [英] Reading UTF-16 file in c++

查看:84
本文介绍了在 C++ 中读取 UTF-16 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 BOM 读取具有 UTF-16LE 编码的文件.我试过这个代码

I'm trying to read a file which has UTF-16LE coding with BOM. I tried this code

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>

int main() {

  std::wifstream fin("/home/asutp/test");
  fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
  if (!fin) {
    std::cout << "!fin" << std::endl;
    return 1;
  }
  if (fin.eof()) {
    std::cout << "fin.eof()" << std::endl;
    return 1;
  }
  std::wstring wstr;
  getline(fin, wstr);
  std::wcout << wstr << std::endl;

  if (wstr.find(L"Test") != std::string::npos) {
    std::cout << "Found" << std::endl;
  } else {
    std::cout << "Not found" << std::endl;
  }

  return 0;
}

文件可以包含拉丁文和西里尔文.我用字符串Test тест"创建了文件.这段代码返回给我

The file can contain Latin and Cyrillic. I created the file with a string "Test тест". And this code returns me

/home/asutp/CLionProjects/untitled/cmake-build-debug/untitled

Not found

Process finished with exit code 0

我使用的是 Linux Mint 18.3 x64,Clion 2018.1

I'm on Linux Mint 18.3 x64, Clion 2018.1

尝试过

  • gcc 版本 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
  • clang 版本 3.8.0-2ubuntu4(标签/RELEASE_380/final)
  • clang 版本 5.0.0-3~16.04.1(标签/RELEASE_500/final)

推荐答案

理想情况下,您应该以 UTF8 格式保存文件,因为 Window 对 UTF8 的支持要好得多(除了在控制台窗口中显示 Unicode),而 POSIX 对 UTF16 的支持有限.甚至 Microsoft 产品也喜欢使用 UTF8 在 Windows 中保存文件.

Ideally you should save files in UTF8, because Window has much better UTF8 support (aside from displaying Unicode in console window), while POSIX has limited UTF16 support. Even Microsoft products favor UTF8 for saving files in Windows.

或者,您可以将 UTF16 文件读入缓冲区并将其转换为 UTF8

As an alternative, you can read the UTF16 file in to a buffer and convert that to UTF8

std::ifstream fin("utf16.txt", std::ios::binary);
fin.seekg(0, ios::end);
size_t size = (size_t)fin.tellg();

//skip BOM
fin.seekg(2, ios::beg);
size -= 2;

std::u16string u16((size / 2) + 1, '\0');
fin.read((char*)&u16[0], size);

std::string utf8 = std::wstring_convert<
    std::codecvt_utf8_utf16<char16_t>, char16_t>{}.to_bytes(u16);

<小时>或者

std::ifstream fin("utf16.txt", std::ios::binary);

//skip BOM
fin.seekg(2);

//read as raw bytes
std::stringstream ss;
ss << fin.rdbuf();
std::string bytes = ss.str();

//make sure len is divisible by 2
int len = bytes.size();
if(len % 2) len--;

std::wstring sw;
for(size_t i = 0; i < len;)
{
    //little-endian
    int lo = bytes[i++] & 0xFF;
    int hi = bytes[i++] & 0xFF;
    sw.push_back(hi << 8 | lo);
}

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf8 = convert.to_bytes(sw);

这篇关于在 C++ 中读取 UTF-16 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆