如何读取包含汉字的UTF-8编码文件并在控制台上正确输出? [英] How to read an UTF-8 encoded file containing Chinese characters and output them correctly on console?
问题描述
我正在编写一个网页抓取工具来抓取一些中文网络文件。提取的文件以utf-8编码。我需要读取这些文件做一些解析,如提取的URL和汉字。但我发现,当我读到一个std :: string变量并将其输出到控制台中的文件,中文字符成为垃圾字符。我将boost :: regex应用到std :: string变量中,并且可以提取除汉字之外的所有URL。
I am writing a web crawler to fetch some Chinese web files. The fetched files are encoded in utf-8. And I need to read those file to do some parse, such as extracting the URLs and Chinese Characters. But I found that when I read the file into a std::string variable and output it into the console, the Chinese characters became garbage characters. I applied the boost::regex into the std::string variable and can extract all URLs but Chinese characters.
如何解决这些问题?
PS默认情况下,我的CPP文件编码为ANSI,操作系统为Win8中文语言;
P.S. My CPP files are encoded as ANSI by default, the operating system is Win8 in Chinese Language;
推荐答案
用VC ++ 2010编译)。我用一个包含非拉丁字符的UTF-8文件测试它,它似乎工作,但我不知道它是否会正常工作与汉字。有关详细信息,请查看以下链接: _setmode 和codecvt_utf8 。
This code may help (it was compiled with VC++ 2010). I tested it with an UTF-8 file containing non-latin characters and it seems to work, but I don't know if it will work fine with Chinese characters. Check the following links for more information: _setmode and codecvt_utf8.
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>
#include <fcntl.h>
#include <io.h>
using namespace std; // Sorry for this!
void read_all_lines(const wchar_t *filename)
{
wifstream wifs;
wstring txtline;
int c = 0;
wifs.open(filename);
if(!wifs.is_open())
{
wcerr << L"Unable to open file" << endl;
return;
}
// We are going to read an UTF-8 file
wifs.imbue(locale(wifs.getloc(), new codecvt_utf8<wchar_t, 0x10ffff, consume_header>()));
while(getline(wifs, txtline))
wcout << ++c << L'\t' << txtline << L'\n';
wcout << endl;
}
int _tmain(int argc, _TCHAR* argv[])
{
// Console output will be UTF-16 characters
_setmode(_fileno(stdout), _O_U16TEXT);
if(argc < 2)
{
wcerr << L"Filename expected!" << endl;
return 1;
}
read_all_lines(argv[1]);
return 0;
}
如果中文字符看起来不符合预期,请确保控制台正在使用支持UTF-16的字体(即不使用位图字体)。
If Chinese characters don't look as expected, make sure the console is using a font that supports UTF-16 (ie. don't use bitmap fonts).
这篇关于如何读取包含汉字的UTF-8编码文件并在控制台上正确输出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!