在C ++中用Cyrillic读写文件 [英] Reading and writing files in Cyrillic in c++

查看:122
本文介绍了在C ++中用Cyrillic读写文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须先用西里尔文读取一个文件,然后随机选择随机的行数,然后将修改后的文本写到另一个文件中.拉丁字母没有问题,但是我遇到了西里尔字母的问题,因为我得到了一些垃圾.所以这就是我试图做的事情.

I have to first read a file in Cyrillic, then randomly pick random number of lines and write modified text to a different file. No problem with Latin letter, but I run into a problem with Cyrillic text, because I get some rubbish. So this is how I tried to do the thing.

说,文件input.txt

ааааааа
ббббббб
ввввввв

我必须阅读它,并将每一行放入向量中:

I have to read it, and put every line into a vector:

vector<wstring> inputVector;
wstring inputString, result;
wifstream inputStream;
inputStream.open("input.txt");
while(!inputStream.eof())
{
    getline(inputStream, inputString);              
    inputVector.push_back(inputString);
}
inputStream.close();    

srand(time(NULL));
int numLines = rand() % inputVector.size();
for(int i = 0; i < numLines; i++)
{
    int randomLine = rand() % inputVector.size();
    result += inputVector[randomLine];
}

wofstream resultStream;
resultStream.open("result.txt");
resultStream << result;
resultStream.close();

那么我该如何使用西里尔字母(Cyrillic)来产生可读的东西,而不仅仅是符号?

So how can I do work with Cyrillic so it produces readable things, not just symbols?

推荐答案

因为您看到了■aaaaaaa 1♦1♦1♦1♦1♦1♦1♦2♦2♦2♦2♦2♦2♦2♦ 2♦打印到控制台,看来input.txt是用UTF-16编码进行编码的,可能是UTF-16 LE +

Because you saw something like ■a a a a a a a 1♦1♦1♦1♦1♦1♦1♦ 2♦2♦2♦2♦2♦2♦2♦ printed to the console, it appears that input.txt is encoded in a UTF-16 encoding, probably UTF-16 LE + BOM. You can use your original code if you change the encoding of the file to UTF-8.

使用UTF-8的原因是,无论文件流的字符类型如何,basic_fstream的基础basic_filebuf都使用codecvt对象将char对象的流与来回转换. char类型的对象流;即,在读取时,将从文件读取的char流转换为wchar_t流,但是在写入时,将wchar_t流转换为char流,然后将其写入文件.对于std::wifstreamcodecvt对象是标准std::codecvt<wchar_t, char, mbstate_t>的实例,通常将UTF-8转换为UCS-16.

The reason for using UTF-8 is that, regardless of the char type of the file stream, basic_fstream's underlying basic_filebuf uses a codecvt object to convert a stream of char objects to/from a stream of objects of the char type; i.e. when reading, the char stream that is read from the file is converted to a wchar_t stream, but when writing, a wchar_t stream is converted to a char stream that is then written to the file. In the case of std::wifstream, the codecvt object is an instance of the standard std::codecvt<wchar_t, char, mbstate_t>, which generally converts UTF-8 to UCS-16.

有关MSDN文档页面上的说明basic_filebuf :

basic_filebuf 类型的对象是使用char *类型的内部缓冲区创建的,而与类型参数 Elem 指定的 char_type 无关.这意味着在将Unicode字符串(包含wchar_t字符)写入内部缓冲区之前,它将转换为ANSI字符串(包含char字符).

Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer.

同样,当读取Unicode字符串(包含wchar_t字符)时,basic_filebuf将从文件读取的ANSI字符串转换为返回给getlinewchar_t字符串和其他读取操作.

Similarly, when reading a Unicode string (containing wchar_t characters), the basic_filebuf converts the ANSI string read from the file to the wchar_t string returned to getline and other read operations.

如果将input.txt的编码更改为UTF-8,则原始程序应该可以正常工作.

If you change the encoding of input.txt to UTF-8, your original program should work correctly.

供参考,这对我有用:

#include <cstdlib>
#include <ctime>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>

int main()
{
    using namespace std;

    vector<wstring> inputVector;
    wstring inputString, result;
    wifstream inputStream;
    inputStream.open("input.txt");
    while(!inputStream.eof())
    {
        getline(inputStream, inputString);
        inputVector.push_back(inputString);
    }
    inputStream.close();

    srand(time(NULL));
    int numLines = rand() % inputVector.size();
    for(int i = 0; i < numLines; i++)
    {
        int randomLine = rand() % inputVector.size();
        result += inputVector[randomLine];
    }

    wofstream resultStream;
    resultStream.open("result.txt");
    resultStream << result;
    resultStream.close();

    return EXIT_SUCCESS;
}

请注意,result.txt的编码也将是UTF-8(通常).

Note that the encoding of result.txt will also be UTF-8 (generally).

这篇关于在C ++中用Cyrillic读写文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆