读取仅保存字符的Unicode文件。 [英] Reading a Unicode file saving only the characters.

查看:89
本文介绍了读取仅保存字符的Unicode文件。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取一个Unicode文件,它将字符添加到 CString szData 但是不添加回车换行对。

I''m trying to read a Unicode file, it adds the characters to the CString szData but does not add the carriage return linefeed pair.

BYTE buffer[3];
BYTE* pBuf = buffer;
BOOL bRead = TRUE;
CString szData;
FILE *fp = NULL;
_wfopen_s(&fp, (LPCTSTR)szFileName, _T("rb"));
while(bRead)
{
    bRead = fread(pBuf, sizeof(BYTE), 2, fp);
    if(*pBuf == '\n')
    {
        szData += "\r\n";
    }
    else
        szData += *pBuf;
}
fclose(fp);
szaText.Add(szData);

推荐答案

我认为你和许多其他程序员一样对unicode感到困惑。 unicode字符集只不过是一个由大约100万个字符组成的表。由于 char 的范围仅为0..255且 wchar_t 的范围为0..65535很明显,你既不能在 char 中存储unicode字符,也不能在 wchar_t 中存储unicode字符。您需要至少32位才能使用1个整数实际编码1个unicode字符(代码点)。因此,如果要使用1个整数来存储任何unicode字符,则必须使用utf-32,这是一种不使用技巧的编码。在 utf32 1 uint32 是unicode表的一个索引。期。但是在实践中很少使用 utf32 ,因为它的内存密集并且浪费了大量内存,特别是在使用大量ascii字符的语言的情况下。因为这个 utf8 utf16 utf32 更广泛但在 utf8 utf16 一个整数( uint8 uint16 )不需要unicode表的索引。例如,在 utf8 的情况下,任何大于127的字符意味着这个和接下来的几个字节在一起(最多4个字节)将存储形成在一起的必要位和索引进入大型unicode表( http://en.wikipedia.org/wiki/UTF-8 [ ^ ])。在 utf16 的情况下,两个wchar_ts也可能形成一个索引(高和低代理对:0xd800-0xdfff,https://en.wikipedia.org/wiki/UTF-16 [ ^ ])。因此,对 utf8 utf16 编码字符串的某些操作无效。例如,strlen()和wcslen()返回字符串中的字符数和wchar_ts而不是实际的unicode字符数(由于我提到的技巧,可以小于字符或wchar_ts的数量)。索引字符串中的unicode字符也是无效的。然而,在许多情况下,这些操作不是必需的,并且还有一些其他操作对这些utf编码也有效,例如连接。



通常你不是真的感兴趣的是字符串的编码和其中的unicode字符,因此您可以将字符串作为一大堆二进制数据处理。事实上,许多程序只是从一些本地化数据库/文件加载字符串,并使用它们在屏幕上显示文本。只有文本渲染器/抽屉方法必须能够将utf编码的二进制数据(字符串)解码为一系列unicode字符,文本抽屉只需要一个简单的迭代器,从左到右从utf数据中检索unicode字符方向,可以用 utf8 utf16 有效地完成,你甚至不需要关心这个如果你使用的是例如windows DrawText()



当然你可能想要程序性地生成程序中的字符串,但这是一个更容易的任务。许多操作允许您将字符串视为一系列字符和wchar_ts,使您的工作更轻松,例如,如果您在utf8中搜索字符串中的下一个换行符,您可以轻松地将字符串作为一系列字符处理,因为所有字节都是一个特殊的多字节字符序列大于127,因此您可以安全地搜索下一个chr(10),而无需实际解释编码字符串中的unicode字符(特殊的多字节和多个wchar_t utf8 / 16内容)。对于所有ascii字符(< 128)也是如此,例如在xml parasers的情况下,其特殊字符是ascii(<>&),这是很方便的。



utf16 utf8 ?您可以将其隐藏为您自己的字符串中的实现细节类以及稍后你可以轻松地改变它,或者你可以使它依赖于平台。在linux上utf8是要走的路但是你甚至可以在windows上使用utf8将数据存储在内存中你可以动态转换为utf16当你调用一个需要 utf16 字符串的windows函数时。很多人错误地用utf8数据调用ansi windows函数。你知道:几乎每个接收字符串参数的windows函数有3个名字,例如: DrawTextA() DrawTextW() DrawText()这只是一个定义为DrawTextA或W的宏。在winNT上,A函数只是使用当前本地将输入字符串转换为 utf16 windows的e然后调用函数的W版本,所以不要错误地使用utf8字符串调用A函数。如果字符串只包含ascii字符(< 128),它将起作用,但它不适用于任何特殊的字符!在Windows上,使用utf16字符串直接调用W函数,因此将字符串存储为 utf16 ,并使用终止空值或存储 utf8 并为您的字符串类写一个 utf16 转换器方法,该方法返回一个临时的 utf16 转换后的字符串!



结论是你可以简单地从/向文件读取/写入文本作为二进制数据,只有当有人开始将二进制数据作为一系列unicode字符处理时,编码才有意义。即使您在文本文件中读取了一大块utf编码的二进制数据,您也可以轻松地将其拆分为行(沿着chr(10)整数),而无需处理行上的实际unicode字符,或者您可以轻松处理本地化文本具有键=值对的行而不关心utf的文件,因为你所要做的就是沿着ascii字符(''='')将行分成两部分。



另一个有趣的事情是,并非所有字节序列(二进制数据)都可以解释为有效的utf8或utf16字符串!当你从文件中读取它时,它值得验证字符串,我通常只在调试版本中运行时验证字符串,以使发布版本更快。在某些情况下,即使在发布版本中,您也可能需要在运行时进行验证,但这种情况很少见。



编辑:当然如果您想使用标准库来检测实际情况编码文件并将其转换为程序使用的格式(例如 utf-16 )然后我的评论只是帮助您了解最新情况的详细信息。文本文件可以以多种格式存储文本。通常,文本文件的前几(2-5)个字节是一个特殊的序列,表示后面文本的编码,这称为BOM(字节顺序标记),现在的文本编辑器没有显示(使用十六进制编辑器来检查这一点): http://en.wikipedia.org/wiki/Byte_order_mark [ ^ ]

请注意,bom开头是该文件不是必需的,但在这种情况下,文本编辑器可能很难猜测格式(有时是不可能的)。



如果您创建数据文件你自己的程序然后你可以使用修复格式,即使没有bom。我们经常使用 utf8 而不是bom,我们的程序不允许其他格式。
I think you are on the confused side about unicode like many other programmers. The unicode character set is nothing more that a table consisting of about 1 million characters. Since the range of a char is just 0..255 and the range of a wchar_t is 0..65535 its obvious that you can store a unicode character neither in a char nor in a wchar_t. You need at least 32 bits to to have the range to actually encode 1 unicode character (codepoint) with 1 integer. For this reason if you want to use 1 integer to store any of the unicode characters then you have to use utf-32 that is an encoding that uses no tricks. In utf32 1 uint32 is one index into the unicode table. Period. However in practice utf32 is rarely used because its memory intensive and wastes a lot of memory especially in case of languages that use a lot of ascii characters. Because of this utf8 and utf16 are more widespread than utf32 but in utf8 and utf16 one integer (uint8 or uint16) alone isn''t necessary an index into the unicode table. For example in case of utf8 any character that is bigger than 127 means that this and the next few bytes together (max 4 bytes) will store the necessary bits that form together and index into the large unicode table (http://en.wikipedia.org/wiki/UTF-8[^]). In case of utf16 it is also possible that two wchar_ts together form an index (high and low surrogate pairs: 0xd800-0xdfff, https://en.wikipedia.org/wiki/UTF-16[^]). For this reason some operations on utf8 and utf16 encoded strings are not effective. For example strlen() and wcslen() return the number of chars and wchar_ts in the string instead of the actual number of unicode characters (that can be less than the number of chars or wchar_ts because of the trick I mentioned). Indexing a unicode character in the string is also ineffective. However in many cases these operations are not required and there are some other operations that are effective with these utf encodings as well, for example concatenation.

Often you are not really interested in the encoding of the string and the unicode characters in it so you can handle the string as a big bunch of binary data. In fact, many programs just load strings from some localization database/file and use them to display text on the screen. Only the text renderer/drawer method has to be able to decode the utf encoded binary data (string) into a sequence of unicode characters and the text drawer needs just a simple iterator that that retrieves the unicode characters from the utf data from left to right direction, that can be done effectively with both utf8 and utf16 and you don''t even have to care about this if you are using for example windows DrawText().

Of course you may want to "procedurally" generate strings in the program but that is an easier task. Many operations allow you to treat the string as a sequence of chars and wchar_ts that make your work easier, for example if you are searching the next newline in a string in utf8 you can easily process the string as a sequence of chars because all bytes of a special multibyte character sequence are bigger than 127 so you can safely search for the next chr(10) without actually interpreting the unicode characters (the special multi byte and multi wchar_t utf8/16 stuff) in the encoded string. The same is true for all ascii characters (<128), this comes handy for example in case of an xml parasers in which the special characters are ascii (<>&").

utf16 or utf8? You can hide this as an implementation detail in your own string class and later you can easily change this as you will, or you can make it platform dependent. On linux utf8 is the way to go but you can use utf8 even on windows to store data in memory and you can convert to utf16 on the fly when you call a windows function that requires an utf16 string. Many make the mistake of calling ansi windows functions with utf8 data. You know: almost every windows functions that receive a string parameter have 3 names, e.g.: DrawTextA() DrawTextW() and DrawText() that is just a macro defined to either DrawTextA or W. On winNT the A functions just convert the input string to utf16 using the current locale of windows and then call the W version of the function, so dont make the mistake of calling A functions with utf8 strings. It will work if the string contains just ascii characters (<128) but it wont work with any special chars! On windows call always the W functions directly with utf16 strings so either store the strings as utf16 with a terminating null or store utf8 and write an utf16 converter method for your string class that returns a temporary utf16 converted string!

The conclusion is that you can simply read/write text from/to files as binary data, the encoding matters only if someone starts processing the binary data as a sequence of unicode characters. Even if you read in the text file as a big chunk of utf encoded binary data you can easily split it into lines (along chr(10) integers) without processing the actual unicode characters on the lines, or you can easily process a localization text file which has lines with key=value pairs without taking care about utf because all you have to do is splitting the line into two parts along an ascii character (''='').

Another interesting thing is that not all bytes sequences (binary data) can be interpreted as a valid utf8 or utf16 string! It worth validating the string when you read it from the file and I usually validate strings at runtime only in debug builds to make the release builds faster. In some cases you may need validation in runtime even in release builds but that is rare.

Of course if you want to use the standard library to detect the actual encoding of the file and convert it to a format your program uses (for example utf-16) then my comments are just details that help you to understand whats going on. A text file can store text in several formats. Usually the first few (2-5) bytes of the text file is a special sequence that indicates the encoding of the text that follows, this is called the BOM (Byte Order Mark) and it isn''t shown by modern text editors (use a hex editor to check this): http://en.wikipedia.org/wiki/Byte_order_mark[^]
Note that a bom at the beginning of the file isn''t required but in that case a text editor might have hard time to guess the format (sometimes its impossible).

If you create data files for your program yourself then you can use a fix format even without bom. We often use utf8 without bom here and our program allows no other format.


\\\\ \\ n是ASCII字符串,而不是Unicode。另外你为什么要阅读 BYTE 而不是 WCHAR s?



您应该使用 fgetws() [ ^ ]读取Unicode。



参见在C / C ++中处理简单的文本文件 [ ^ ]。
"\r\n" is an ASCII string, not Unicode. Also why are you reading BYTEs rather than WCHARs?

You should use fgetws()[^] to read Unicode.

See also Handling simple text files in C/C++[^].


检查文件的编码。首先是真的是UTF-16吗? UTF-8更常见。

如果它是UTF-16,那么你应该从文件中读取宽字符,正如理查德所说,并将它与16位值''进行比较\ n''(_ T(''\ n'')或0x000a),并在字符串中添加宽字符。



如果,在另一方面,它是UTF-8,然后需要使用多字节到宽字符函数之一转换某些字节序列,指定UTF-8的编码。

在这种情况下它可能是你恰好错过了''\ n'',因为它碰巧在第二个字节。

另外,即使你确实看到''\ n''' ,你会丢失信息,因为你忽略了你正在阅读的第二个字节。

一次读一个字节并将它们添加到正常的8位字符串中,然后当你找到行尾时添加\\\\ n并使用转换函数(例如mbstowcs)将字符串转换为宽字符。


总结:您需要了解文件的实际编码方式。说它是Unicode并不意味着什么,因为有几种方法可以在文件中编码Unicode。 UTF-8是最常见的编码(例如在网站上等)。其中一个原因是因为它意味着文件大小可能要小得多(如果所有字符都小于Unicode 0800,包括所有欧洲脚本,肯定会是这样)。



问候,

Ian。
Check the encoding of your file. First is it really UTF-16? UTF-8 is more common.
If it is UTF-16, then you should be reading wide-characters from the file, as Richard has said, and comparing it with 16-bit value of ''\n'' (_T(''\n'') or 0x000a), and add the wide-characters to your string.

If, on the other hand, it is UTF-8, then certain byte sequences will need to be converted using one of the multi-byte to wide character functions, specifying an encoding of UTF-8.
In this case it may be that you just happen to be missing the ''\n'' because it happens to be in the second byte.
Also, even if you do see the ''\n'', you would lose information because you ignore the second byte you are reading.
Read one byte at a time and add them to a normal 8-bit character string, and then when you find the end of line add the "\r\n" and convert the string to wide-characters using a conversion function (such as mbstowcs).

In summary: You need to find out how the file is actually encoding. Saying it is "Unicode" doesn''t mean anything as there are several ways to encode Unicode in a file. UTF-8 is the most common encoding (e.g. on websites, etc). One reason for this is because it means the file size is likely to be much smaller (and will definitely be if all the characters are less than Unicode 0800, which includes all European scripts).

Regards,
Ian.


这篇关于读取仅保存字符的Unicode文件。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆