用西里尔文读文件 [英] Reading file with cyrillic

查看:116
本文介绍了用西里尔文读文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须用西里尔字符打开文件。我已将文件编码到utf8中。这是一个例子:


en:您的家人不能为您支付
服装

  ;  ru:Невашасемья
позволитьсебекостюмдлявас


如何打开文件:

  ifstream readFile(fileData.c_str()); 
while(!readFile.eof())
{
std :: getline(readFile,buffer);
...
}

第一个麻烦,前面有一些符号文本'en'(我在调试器中看到这个):


ï¿en:至少


另一个麻烦是西里尔字符:


ru:наÐ


有什么问题?

解决方案


在文本en之前有一些符号


这是一个人造BOM ,将U + FEFF BYTE ORDER MARK字符编码为UTF-8的结果。



由于UTF-8是一个没有字节顺序的编码, -BOM不应该被使用,但不幸的是,相当多的现有软件(特别是在MS世界)也是如此。将消息文件加载到文本编辑器中,并将其再次保存为UTF-8,如果特别列出,则使用不带BOM的UTF-8编码。


ru:наимÐμньÑий


这是你得到一个UTF -8字节字符串(代表наименьший),并将其打印成像是Code Page 1252(Windows Western European)字节字符串。这不是输入问题;您已经读取了字符串OK,并具有UTF-8字节字符串。但是,在代码中你没有引用,它的输出为cp1252。



如果你只是打印到控制台,这是可以预期的,因为控制台始终使用系统默认代码页(西式Windows安装上为1252),而不是UTF-8。如果您需要将Unicode发送到控制台,则必须将字节转换为native-Unicode wchar 并从中将其写入。我不知道你的字符串的最终目的地是什么,如果你只是要写他们到另一个文件,或者你可以把它们保持为字节,而不在乎他们的编码。 / p>

I have to open file with cyrillic symbols. I've encoded file into utf8. Here is example:

en: Couldn't your family afford a costume for you
  ru: Не ваша семья позволить себе костюм для вас

How do I open file:

ifstream readFile(fileData.c_str());
while (!readFile.eof())
{
  std::getline(readFile, buffer);
  ...
}

The first trouble, there is some symbol before text 'en' (I saw this in debugger):

"en: least"

And another trouble is cyrillic symbols:

" ru: наименьший"

What's wrong?

解决方案

there is some symbol before text 'en'

That's a faux-BOM, the result of encoding a U+FEFF BYTE ORDER MARK character into UTF-8.

Since UTF-8 is an encoding that does not have a byte order, the faux-BOM shouldn't ever be used, but unfortunately quite a bit of existing software (especially in the MS world) does nonetheless. Load the messages file into a text editor and save it back out again as UTF-8, using a "UTF-8 without BOM" encoding if one is especially listed.

ru: наименьший

That's what you get when you've got a UTF-8 byte string (representing наименьший) and you print it as if it were a Code Page 1252 (Windows Western European) byte string. It's not an input problem; you have read in the string OK and have a UTF-8 byte string. But then, in code you haven't quoted, it gets output as cp1252.

If you're just printing it to the console, this is to be expected, as the console always uses the system default code page (1252 on a Western Windows install), and not UTF-8. If you need to send Unicode to the console you'll have to convert the bytes to native-Unicode wchar​s and write them from there. I don't know what the final destination for your strings is though... if you're just going to write them to another file or something you could just keep them as bytes and not care about what encoding they're in.

这篇关于用西里尔文读文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆