在没有外部库的情况下将 utf-16 文本文件正确读入字符串? [英] Correctly reading a utf-16 text file into a string without external libraries?

查看:24
本文介绍了在没有外部库的情况下将 utf-16 文本文件正确读入字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从一开始就一直在使用 StackOverflow,有时也想提出问题,但我总是要么自己弄清楚,要么最终找到答案……直到现在.这感觉应该相当简单,但我已经在互联网上徘徊了几个小时没有成功,所以我转向这里:

I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feels like it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here:

我有一个非常标准的 utf-16 文本文件,混合了英文和中文字符.我希望这些字符以字符串结尾(从技术上讲,是 wstring).我已经看到很多相关问题的回答(这里和其他地方),但他们要么希望解决更难的问题,即在不知道编码的情况下读取任意文件,或者在编码之间进行转换,或者只是普遍对Unicode"是一系列编码感到困惑.我知道我试图阅读的文本文件的来源,它总是 UTF16,它有一个 BOM 和所有东西,它可以保持这种状态.

I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of related questions answered (here and elsewhere), but they're either looking to solve the much harder problem of reading arbitrary files without knowing the encoding, or converting between encodings, or are just generally confused about "Unicode" being a range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a BOM and everything, and it can stay that way.

我一直在使用此处描述的解决方案,它适用于全英文的文本文件,但在遇到某些字符后,它停止读取文件.我发现的唯一其他建议是使用 ICU,这可能会奏效,但我真的宁愿不包括整个大型分发应用程序中的库,只需在一个地方读取一个文本文件.不过,我不关心系统独立性——我只需要它来编译和在 Windows 中工作.当然,不依赖于该事实的解决方案会更漂亮,但我会很高兴使用 stl 的解决方案同时依赖于有关 Windows 体系结构的假设,甚至涉及的解决方案win32 函数,或 ATL;我只是不想包含另一个像 ICU 这样的大型 3rd 方库.除非我想自己重新实现它,否则我仍然完全不走运吗?

I had been using the solution described here, which worked for text files that were all English, but after encountering certain characters, it stopped reading the file. The only other suggestion I found was to use ICU, which would probably work, but I'd really rather not include a whole large library in an application for distribution, just to read one text file in one place. I don't care about system independence, though - I only need it to compile and work in Windows. A solution that didn't rely on that fact would prettier, of course, but I would be just as happy for a solution that used the stl while relying on assumptions about Windows architecture, or even solutions that involved win32 functions, or ATL; I just don't want to have to include another large 3rd-party library like ICU. Am I still totally out of luck unless I want to reimplement it all myself?

对于这个特定的项目,我一直在使用 VS2008,所以很遗憾 C++11 代码无济于事.

edit: I'm stuck using VS2008 for this particular project, so C++11 code sadly won't help.

编辑 2:我意识到 代码 我之前借用过的非英文字符并没有像我想象的那样失败.相反,它在我的测试文档中的特定字符上失败,其中包括 ':' (FULLWIDTH COLON, U+FF1A) 和 ')' (FULLWIDTH RIGHT PARENTHESIS, U+FF09).bames53 发布的解决方案也大多有效,但被那些相同的字符难住了?

edit 2: I realized that the code I had been borrowing before didn't fail on non-English characters like I thought it was doing. Rather, it fails on specific characters in my test document, among them ':' (FULLWIDTH COLON, U+FF1A) and ')' (FULLWIDTH RIGHT PARENTHESIS, U+FF09). bames53's posted solution also mostly works, but is stumped by those same characters?

编辑 3(和答案!):我一直使用的原始代码 - 确实 - 大部分工作 - 正如 bames53 帮助我发现的那样,ifstream 只需要以二进制模式打开即可工作.

edit 3 (and the answer!): the original code I had been using -did- mostly work - as bames53 helped me discover, the ifstream just needed to be opened in binary mode for it to work.

推荐答案

打开 UTF-16 文件时,必须以二进制模式打开它.这是因为在文本模式下,某些字符会被特殊解释——具体来说,0x0d 被完全过滤掉,0x1a 标志着文件的结尾.有一些 UTF-16 字符会将这些字节之一作为字符代码的一半,并且会扰乱文件的读取.这不是错误,而是有意的行为,并且是将文本模式和二进制模式分开的唯一原因.

When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.

关于将 0x1a 视为文件结尾的原因,请参阅此 Raymond Chen 的博文 追溯了 Ctrl-Z 的历史.它基本上是向后兼容的.

For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.

这篇关于在没有外部库的情况下将 utf-16 文本文件正确读入字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆