C ++如何通过忽略每行的第一个字符从Unicode文件中读取 [英] C++ how to read from unicode files by ignoring first character of each line

查看:102
本文介绍了C ++如何通过忽略每行的第一个字符从Unicode文件中读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑如下包含Unicode单词的文件

Consider a file containing Unicode words as follows

آب
آباد
آبادان

如果从右到左阅读,则第一个字符为آ。

if you read right to left, the first character is " آ ".

我的第一个要求是逐行读取文件。这很简单。

My first requirement is to read the file line by line. This would be simple.

第二个要求是从每一行的第二个字符开始逐行读取文件。结果必须是这样的

The second requirement is to read the file line by line from the second character of each line. the result must be something like this

ب
باد
بادان

您知道有些解决方案可以满足第二个要求,例如std :: substr,但是Afaik std :: substr不适用于Unicode字符。

As you know there are some solutions like std::substr to meet the second requirement but Afaik std::substr does not works well with Unicode Characters.

我需要这样的东西

std::ifstream inFile(file_name);
//Solution for first requirement
std::string line;
if (!std::getline(inFile, line)) {
   std::cout << "failed to read file " << file_name << std::endl;
   inFile.close();
   break;
}
line.erase(line.find_last_not_of("\n\r") + 1);

std::string line2;
//what should be here to meet my second requirement?
//stay on current line      
//ignore first character and std::getline(inFile, line2)) 
line2.erase(line.find_last_not_of("\n\r") + 1);

std::cout<<"Line= "<<line<<std::cout; //should prints آب
std::cout<<"Line2= "<<line<<std::cout; //should prints 

inFile.close();


推荐答案

C ++ 11 具有unicode转换例程,但是它们不是非常用户友好的。但是您可以使用它们来使它们具有更多的用户友好功能:

C++11 has unicode conversion routines but they are not very user friendly. But you can make more user friendly functions with them like this:

// This should convert to whatever the system wide character encoding
// is for the platform (UTF-32/Linux - UCS-2/Windows)
std::string ws_to_utf8(std::wstring const& s)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::string utf8 = cnv.to_bytes(s);
    if(cnv.converted() < s.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::wstring utf8_to_ws(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::wstring s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

std::string remove_first_char(std::string const& utf8)
{
    std::wstring ws = utf8_to_ws(utf8);
    ws = ws.substr(1);
    return ws_to_utf8(ws);
}

int main()
{
    std::string utf8 = u8"آبادان";

    std::cout << remove_first_char(utf8) << '\n';
}

输出:

بادان

通过转换为带有代码点的固定点(UCS-2 / UTF-32),您可以使用常规字符串功能处理字符串。有一个警告。 UCS-2 不能涵盖所有语言的所有字符,因此您可能必须使用 std :: u32string 并编写一个 UTF-8 UTF-32 之间的转换函数。

By converting to a fixed with code-point (UCS-2/UTF-32) you can process the string using the normal string functions. There is a caveat though. UCS-2 does not cover all characters of all languages so you may have to use std::u32string and write a conversion function between UTF-8 and UTF-32.

这个答案有一个例子: https://stackoverflow.com/a/43302460/3807729

This answer has an example: https://stackoverflow.com/a/43302460/3807729

这篇关于C ++如何通过忽略每行的第一个字符从Unicode文件中读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆