将getline与unicode文件一起使用时出现问题 [英] problem using getline with a unicode file

查看:152
本文介绍了将getline与unicode文件一起使用时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新:谢谢@Potatoswatter和@Jonathan Leffler的评论-令人尴尬的是,我被调试器工具提示所吸引,没有正确显示wstring的值-但是,它仍然对我不起作用,我拥有更新了以下问题:

UPDATE: Thank you to @Potatoswatter and @Jonathan Leffler for comments - rather embarrassingly I was caught out by the debugger tool tip not showing the value of a wstring correctly - however it still isn't quite working for me and I have updated the question below:

如果我有一个小的多字节文件,我想读入一个字符串,则使用以下技巧-我使用getline且其分度为'\0',例如

If I have a small multibyte file I want to read into a string I use the following trick - I use getline with a delimeter of '\0' e.g.

std::string contents_utf8;
std::ifstream inf1("utf8.txt");
getline(inf1, contents_utf8, '\0');

这将读取整个文件,包括换行符.
但是,如果我尝试对宽字符文件执行相同的操作,将无法正常工作-我的wstring仅读取到第一行.

This reads in the entire file including newlines.
However if I try to do the same thing with a wide character file it doesn't work - my wstring only reads to the the first line.

std::wstring contents_wide;
std::wifstream inf2(L"ucs2-be.txt");
getline( inf2, contents_wide, wchar_t(0) ); //doesn't work

例如,如果我的unicode文件包含由CRLF分隔的字符A和B,则十六进制如下所示:

For example my if unicode file contains the chars A and B seperated by CRLF, the hex looks like this:

FE FF 00 41 00 0D 00 0A 00 42

基于这样的事实,对于多字节文件,getline带有'\ 0'会读取整个文件,我相信getline( inf2, contents_wide, wchar_t(0) )应该读取整个unicode文件.但是,事实并非如此-在上面的示例中,我的宽字符串将包含以下两个wchar_ts:FF FF

Based on the fact that with a multibyte file getline with '\0' reads the entire file I believed that getline( inf2, contents_wide, wchar_t(0) ) should read in the entire unicode file. However it doesn't - with the example above my wide string would contain the following two wchar_ts: FF FF

(如果我删除wchar_t(0),它将按预期方式在第一行中读取(即FE FF 00 41 00 0D 00)

(If I remove the wchar_t(0) it reads in the first line as expected (ie FE FF 00 41 00 0D 00)

为什么wchar_t(0)不能作为定界wchar_t来使getline在00 00上停止(或读取到我想要的文件末尾)?
谢谢

Why doesn't wchar_t(0) work as a delimiting wchar_t so that getline stops on 00 00 (or reads to the end of the file which is what I want)?
Thank you

推荐答案

您的UCS-2解码器行为异常. FE FF 00 41 00 0D 00 0A 00 42getline( inf2, contents_wide )的结果应为0041 0000 = L"A".假设您使用的是Windows,则应正确转换行尾,并且字节序标记不应出现在输出中.

Your UCS-2 decoder is misbehaving. The result of getline( inf2, contents_wide ) on FE FF 00 41 00 0D 00 0A 00 42 should be 0041 0000 = L"A". Assuming you're on Windows, the line ending should be properly converted, and the byte-order mark shouldn't appear in the output.

关于设置语言环境的建议,请仔细检查您的OS文档.

Suggest double-checking your OS documentation with respect to how you set the locale.

编辑:您是否设置了语言环境?

Did you set the locale?

locale::global( locale( "something if your system supports UCS-2" ) );

locale::global( encoding_support::ucs2_bigendian_encoding );

其中encoding_support是一些库.

where encoding_support is some library.

这篇关于将getline与unicode文件一起使用时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆