将getline与unicode文件一起使用时出现问题 [英] problem using getline with a unicode file
问题描述
更新:谢谢@Potatoswatter和@Jonathan Leffler的评论-令人尴尬的是,我被调试器工具提示所吸引,没有正确显示wstring的值-但是,它仍然对我不起作用,我拥有更新了以下问题:
UPDATE: Thank you to @Potatoswatter and @Jonathan Leffler for comments - rather embarrassingly I was caught out by the debugger tool tip not showing the value of a wstring correctly - however it still isn't quite working for me and I have updated the question below:
如果我有一个小的多字节文件,我想读入一个字符串,则使用以下技巧-我使用getline
且其分度为'\0'
,例如
If I have a small multibyte file I want to read into a string I use the following trick - I use getline
with a delimeter of '\0'
e.g.
std::string contents_utf8;
std::ifstream inf1("utf8.txt");
getline(inf1, contents_utf8, '\0');
这将读取整个文件,包括换行符.
但是,如果我尝试对宽字符文件执行相同的操作,将无法正常工作-我的wstring
仅读取到第一行.
This reads in the entire file including newlines.
However if I try to do the same thing with a wide character file it doesn't work - my wstring
only reads to the the first line.
std::wstring contents_wide;
std::wifstream inf2(L"ucs2-be.txt");
getline( inf2, contents_wide, wchar_t(0) ); //doesn't work
例如,如果我的unicode文件包含由CRLF分隔的字符A和B,则十六进制如下所示:
For example my if unicode file contains the chars A and B seperated by CRLF, the hex looks like this:
FE FF 00 41 00 0D 00 0A 00 42
基于这样的事实,对于多字节文件,getline带有'\ 0'会读取整个文件,我相信getline( inf2, contents_wide, wchar_t(0) )
应该读取整个unicode文件.但是,事实并非如此-在上面的示例中,我的宽字符串将包含以下两个wchar_ts:FF FF
Based on the fact that with a multibyte file getline with '\0' reads the entire file I believed that getline( inf2, contents_wide, wchar_t(0) )
should read in the entire unicode file. However it doesn't - with the example above my wide string would contain the following two wchar_ts: FF FF
(如果我删除wchar_t(0),它将按预期方式在第一行中读取(即FE FF 00 41 00 0D 00
)
(If I remove the wchar_t(0) it reads in the first line as expected (ie FE FF 00 41 00 0D 00
)
为什么wchar_t(0)不能作为定界wchar_t来使getline在00 00
上停止(或读取到我想要的文件末尾)?
谢谢
Why doesn't wchar_t(0) work as a delimiting wchar_t so that getline stops on 00 00
(or reads to the end of the file which is what I want)?
Thank you
推荐答案
您的UCS-2解码器行为异常. FE FF 00 41 00 0D 00 0A 00 42
上getline( inf2, contents_wide )
的结果应为0041 0000
= L"A"
.假设您使用的是Windows,则应正确转换行尾,并且字节序标记不应出现在输出中.
Your UCS-2 decoder is misbehaving. The result of getline( inf2, contents_wide )
on FE FF 00 41 00 0D 00 0A 00 42
should be 0041 0000
= L"A"
. Assuming you're on Windows, the line ending should be properly converted, and the byte-order mark shouldn't appear in the output.
关于设置语言环境的建议,请仔细检查您的OS文档.
Suggest double-checking your OS documentation with respect to how you set the locale.
编辑:您是否设置了语言环境?
Did you set the locale?
locale::global( locale( "something if your system supports UCS-2" ) );
或
locale::global( encoding_support::ucs2_bigendian_encoding );
其中encoding_support是一些库.
where encoding_support is some library.
这篇关于将getline与unicode文件一起使用时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!