如何在Linux上使用std :: ifstream从文件中读取非ASCII行? [英] How to read non-ASCII lines from file with std::ifstream on Linux?

查看:565
本文介绍了如何在Linux上使用std :: ifstream从文件中读取非ASCII行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取纯文本文件.就我而言,我需要逐行读取一行并处理该信息.我知道C ++有读取wchar的东西.我尝试了以下方法:

I was trying to read a plain text file. In my case, I need to read line per line, and process that information. I know the C++ has wstuffs for reading wchars. I tried the following:

#include <fstream>
#include <iostream>

int main() {
    std::wfstream file("file");       // aaaàaaa
    std::wstring str;
    std::getline(file, str);
    std::wcout << str << std::endl;   // aaa
}

但是您可以看到,它没有读完整行.当读取非ASCII的à"时,它将停止.我该如何解决?

But as you can see, it did not read a full line. It stops when reads "à", which is non-ASCII. How can I fix it?

推荐答案

您将需要了解编码的一些基本概念.我建议阅读这篇文章: 每个软件开发人员绝对,肯定地必须了解Unicode和字符集 .基本上,您不能假设每个字节都是一个字母,并且每个字母都适合char.另外,系统必须知道如何从文件中的字节序列中提取字母.

You will need to understand some basic concepts of encodings. I recommend reading this article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Basically you can't assume every byte is a letter and that every letter fits in a char. Also, the system must know how to extract letters from the sequence of bytes you have on the file.

假设您的文件使用UTF-8编码,这很可能是因为您使用的是Linux.我假设您的终端也支持它.如果您直接使用带字符的std::string进行阅读,那么一切都会正常进行.看:

Let's assume your file is encoded in UTF-8, this is likely given that you are on Linux. I'll assume your terminal also supports it. If you directly read using a std::string, with chars, you will have everything working. Look:

// olá
#include <iostream>
#include <fstream>
int main() {
    std::fstream file("test.cpp");
    std::string str;
    std::getline(file, str);
    std::cout << str << std::endl;
}

输出是您所期望的,但这并不是正确的.查看发生了什么:该文件以utf-8编码.这意味着第一行是该字节序列:

The output is what you expect, but this is not really correct. Look at what is going on: The file is encoded in utf-8. This means the first line is this byte sequence:

/  /     o   l       á
47 47 32 111 108 195 161

请注意, á编码为两个字节 .如果您询问字符串的大小(str.size()),则确实会得到错误的值:7.发生这种情况是因为字符串认为每个字节都是一个字符.当您将其发送到std::cout时,该字符串将被提供给终端进行打印.还有一个神奇的部分:默认情况下,该终端可与utf-8一起使用.因此,仅假设字符串为utf-8并正确打印6个字符.

Note that á is encoded with two bytes. If you ask the size of the string (str.size()), you will indeed get the wrong value: 7. This happens because the string thinks every byte is a char. When you send it to std::cout, the string will be given to the terminal to print. And the magical part: The terminal works with utf-8 by default. So it just assumes the string is utf-8 and correctly prints 6 chars.

您会看到它有效,但事实并非如此.尝试对数据进行任何字符串操作,您可能会破坏utf-8编码,并且将永远无法再次打印!

You see that it works, but it is not really right. Try to make any string operation on the data and you may break the utf-8 encoding and will never be able to print it again!

我们去吧wstring.他们使用wchar_t存储每个字母,在Linux中,该wchar_t有4个字节.这足以容纳任何可能的unicode字符.但是它不能直接工作,因为默认情况下C ++使用"C"语言环境.语言环境是关于如何处理系统各个方面的规范,例如如何打印日期"或如何格式化货币值"甚至如何对文本进行解码".最后一个因素很重要,默认的"C"编码表示:假设所有内容都是ASCII".当它正在读取文件并尝试对非ASCII字节进行解码时,它只会静默失败.

Let's go for wstrings. They store each letter with a wchar_t that, on Linux, has 4 bytes. This is enough to hold any possible unicode character. But it will not work directly because C++ by default uses the "C" locale. A locale is a specification of how to deal with various aspects of the system, like "how to print a date" or "how to format a currency value" or even "how to decode text". The last factor is important and the default "C" encoding says: "Assume everything is ASCII". When it is reading the file and tries to decode a non-ASCII byte, it just fails silently.

更正很简单:使用UTF-8语言环境.看:

The correction is simple: Use a UTF-8 locale. Look:

// olá
#include <iostream>
#include <fstream>
#include <locale>

int main() {
    std::ios::sync_with_stdio(false);

    std::locale loc("en_US.UTF-8"); // You can also use "" for the default system locale
    std::wcout.imbue(loc); // Use it for output

    std::wfstream file("test.cpp");
    file.imbue(loc); // Use it for file input
    std::wstring str;
    std::getline(file, str); // str.size() will be 6
    std::wcout << str << std::endl;
}

您可能会问std::ios::sync_with_stdio(false);是什么意思.这是必需的,因为默认情况下C ++流与C流保持同步.这很好,因为使您可以在同一程序上同时使用coutprintf.我们必须禁用它,因为C流将破坏utf-8编码并在输出上产生垃圾.

You may be asking what std::ios::sync_with_stdio(false); means. It is required because by default C++ streams are kept in sync with C streams. This is good because enables you to use both cout and printf on the same program. We have to disable it because C streams will break the utf-8 encoding and will produce garbage on the output.

这篇关于如何在Linux上使用std :: ifstream从文件中读取非ASCII行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆