正确读取utf-16文本文件到一个没有外部库的字符串? [英] Correctly reading a utf-16 text file into a string without external libraries?

查看:212
本文介绍了正确读取utf-16文本文件到一个没有外部库的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

自从一开始我一直在使用StackOverflow,有时被诱惑提出问题,但我总是要么自己想出来,要么找到最终发布的答案...直到现在。这种感觉非常简单,但我一直在互联网上漫游几个小时没有成功,所以我转过来:



我有一个漂亮的标准utf-16文本文件,混合的英文和中文字符。我希望这些字符结束在一个字符串(技术上,一个wstring)。我已经看到了很多相关的问题(这里和其他地方),但是他们要解决读取任意文件而不了解编码,或者在编码之间进行转换的更难的问题,或者只是通常困惑的Unicode是一系列的编码。我知道文本文件的源我试图读,它将永远是UTF16,它有一个BOM和一切,它可以保持这种方式。



我一直在使用这里,它工作的文本文件,都是英语,但遇到某些字符后,它停止读取文件。我发现的唯一其他建议是使用 ICU ,这可能会工作,但我真的不想包括一个整体库在应用程序中分发,只是在一个地方读取一个文本文件。我不在乎系统的独立性,虽然 - 我只需要它来编译和工作在Windows。当然,一个不依赖于这个事实的解决方案会更好一些,但是我会对使用stl的解决方案感到满意,同时依赖于关于Windows架构的假设,甚至是涉及到的解决方案win32函数,或ATL;我只是不想要包括另一个大的第三方库,如ICU。我还是完全没有运气,除非我想重新实现它所有自己?



编辑:我被卡住使用VS2008这个特定的项目,所以C ++ 11代码
$ b

-unicode-utf16-file-in-windows-c>代码我以前一直借用没有失败在非英语字符,我认为它在做。相反,它对我的​​测试文档中的特定字符失败,其中包括::(FULLWIDTH COLON,U + FF1A)和')'(FULLWIDTH RIGHT PARENTHESIS,U + FF09)。 bames53的发布的解决方案也大多是工作,但是被那些相同的字符困扰?



edit 3(和答案!):原来的代码我一直使用-did-工作 - 作为bames53帮助我发现,ifstream只是需要以二进制模式打开它工作。

解决方案

打开UTF-16的文件,您必须以二进制模式打开它。这是因为在文本模式下,特定字符被特别解释,具体来说,0x0d被完全过滤掉,0x1a标记文件的结尾。有一些UTF-16字符将这些字节之一作为字符代码的一半,并将弄乱文件的读取。这不是一个错误,它是故意的行为,是有单独的文本和二进制模式的唯一原因。



为什么0x1a被认为是文件的结尾,请参阅此来自Raymond Chen的博客文章跟踪Ctrl- Z.它基本上是向后兼容运行的。


I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feels like it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here:

I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of related questions answered (here and elsewhere), but they're either looking to solve the much harder problem of reading arbitrary files without knowing the encoding, or converting between encodings, or are just generally confused about "Unicode" being a range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a BOM and everything, and it can stay that way.

I had been using the solution described here, which worked for text files that were all English, but after encountering certain characters, it stopped reading the file. The only other suggestion I found was to use ICU, which would probably work, but I'd really rather not include a whole large library in an application for distribution, just to read one text file in one place. I don't care about system independence, though - I only need it to compile and work in Windows. A solution that didn't rely on that fact would prettier, of course, but I would be just as happy for a solution that used the stl while relying on assumptions about Windows architecture, or even solutions that involved win32 functions, or ATL; I just don't want to have to include another large 3rd-party library like ICU. Am I still totally out of luck unless I want to reimplement it all myself?

edit: I'm stuck using VS2008 for this particular project, so C++11 code sadly won't help.

edit 2: I realized that the code I had been borrowing before didn't fail on non-English characters like I thought it was doing. Rather, it fails on specific characters in my test document, among them ':' (FULLWIDTH COLON, U+FF1A) and ')' (FULLWIDTH RIGHT PARENTHESIS, U+FF09). bames53's posted solution also mostly works, but is stumped by those same characters?

edit 3 (and the answer!): the original code I had been using -did- mostly work - as bames53 helped me discover, the ifstream just needed to be opened in binary mode for it to work.

解决方案

When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.

For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.

这篇关于正确读取utf-16文本文件到一个没有外部库的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆