在C ++中用Cyrillic读写文件 [英] Reading and writing files in Cyrillic in c++
问题描述
我必须先用西里尔文读取一个文件,然后随机选择随机的行数,然后将修改后的文本写到另一个文件中.拉丁字母没有问题,但是我遇到了西里尔字母的问题,因为我得到了一些垃圾.所以这就是我试图做的事情.
I have to first read a file in Cyrillic, then randomly pick random number of lines and write modified text to a different file. No problem with Latin letter, but I run into a problem with Cyrillic text, because I get some rubbish. So this is how I tried to do the thing.
说,文件input.txt
是
ааааааа
ббббббб
ввввввв
我必须阅读它,并将每一行放入向量中:
I have to read it, and put every line into a vector:
vector<wstring> inputVector;
wstring inputString, result;
wifstream inputStream;
inputStream.open("input.txt");
while(!inputStream.eof())
{
getline(inputStream, inputString);
inputVector.push_back(inputString);
}
inputStream.close();
srand(time(NULL));
int numLines = rand() % inputVector.size();
for(int i = 0; i < numLines; i++)
{
int randomLine = rand() % inputVector.size();
result += inputVector[randomLine];
}
wofstream resultStream;
resultStream.open("result.txt");
resultStream << result;
resultStream.close();
那么我该如何使用西里尔字母(Cyrillic)来产生可读的东西,而不仅仅是符号?
So how can I do work with Cyrillic so it produces readable things, not just symbols?
推荐答案
因为您看到了■aaaaaaa 1♦1♦1♦1♦1♦1♦1♦2♦2♦2♦2♦2♦2♦2♦ 2♦打印到控制台,看来input.txt
是用UTF-16编码进行编码的,可能是UTF-16 LE +
Because you saw something like ■a a a a a a a 1♦1♦1♦1♦1♦1♦1♦ 2♦2♦2♦2♦2♦2♦2♦ printed to the console, it appears that input.txt
is encoded in a UTF-16 encoding, probably UTF-16 LE + BOM. You can use your original code if you change the encoding of the file to UTF-8.
使用UTF-8的原因是,无论文件流的字符类型如何,basic_fstream
的基础basic_filebuf
都使用codecvt
对象将char
对象的流与来回转换. char类型的对象流;即,在读取时,将从文件读取的char
流转换为wchar_t
流,但是在写入时,将wchar_t
流转换为char
流,然后将其写入文件.对于std::wifstream
,codecvt
对象是标准std::codecvt<wchar_t, char, mbstate_t>
的实例,通常将UTF-8转换为UCS-16.
The reason for using UTF-8 is that, regardless of the char type of the file stream, basic_fstream
's underlying basic_filebuf
uses a codecvt
object to convert a stream of char
objects to/from a stream of objects of the char type; i.e. when reading, the char
stream that is read from the file is converted to a wchar_t
stream, but when writing, a wchar_t
stream is converted to a char
stream that is then written to the file. In the case of std::wifstream
, the codecvt
object is an instance of the standard std::codecvt<wchar_t, char, mbstate_t>
, which generally converts UTF-8 to UCS-16.
如有关MSDN文档页面上的说明basic_filebuf
:
basic_filebuf 类型的对象是使用char *类型的内部缓冲区创建的,而与类型参数 Elem 指定的 char_type 无关.这意味着在将Unicode字符串(包含wchar_t字符)写入内部缓冲区之前,它将转换为ANSI字符串(包含char字符).
Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer.
同样,当读取Unicode字符串(包含wchar_t
字符)时,basic_filebuf
将从文件读取的ANSI字符串转换为返回给getline
的wchar_t
字符串和其他读取操作.
Similarly, when reading a Unicode string (containing wchar_t
characters), the basic_filebuf
converts the ANSI string read from the file to the wchar_t
string returned to getline
and other read operations.
如果将input.txt
的编码更改为UTF-8,则原始程序应该可以正常工作.
If you change the encoding of input.txt
to UTF-8, your original program should work correctly.
供参考,这对我有用:
#include <cstdlib>
#include <ctime>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
int main()
{
using namespace std;
vector<wstring> inputVector;
wstring inputString, result;
wifstream inputStream;
inputStream.open("input.txt");
while(!inputStream.eof())
{
getline(inputStream, inputString);
inputVector.push_back(inputString);
}
inputStream.close();
srand(time(NULL));
int numLines = rand() % inputVector.size();
for(int i = 0; i < numLines; i++)
{
int randomLine = rand() % inputVector.size();
result += inputVector[randomLine];
}
wofstream resultStream;
resultStream.open("result.txt");
resultStream << result;
resultStream.close();
return EXIT_SUCCESS;
}
请注意,result.txt
的编码也将是UTF-8(通常).
Note that the encoding of result.txt
will also be UTF-8 (generally).
这篇关于在C ++中用Cyrillic读写文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!