将文件读入std :: string的最有效方法是什么? [英] What's the most efficient way to read a file into a std::string?

查看:111
本文介绍了将文件读入std :: string的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在执行此操作,最后转换为std :: string占用了98%的执行时间。肯定有更好的方法!

I currently do this, and the conversion to std::string at the end take 98% of the execution time. There must be a better way!

std::string
file2string(std::string filename)
{
    std::ifstream file(filename.c_str());
    if(!file.is_open()){
        // If they passed a bad file name, or one we have no read access to,
        // we pass back an empty string.
        return "";
    }
    // find out how much data there is
    file.seekg(0,std::ios::end);
    std::streampos length = file.tellg();
    file.seekg(0,std::ios::beg);
    // Get a vector that size and
    std::vector<char> buf(length);
    // Fill the buffer with the size
    file.read(&buf[0],length);
    file.close();
    // return buffer as string
    std::string s(buf.begin(),buf.end());
    return s;
}


推荐答案

成为C ++的忠实拥护者迭代器抽象和算法,我希望以下是将文件(或任何其他输入流)读入 std :: string (然后打印的一种快速方法)内容):

Being a big fan of C++ iterator abstraction and the algorithms, I would love the following to be the fasted way to read a file (or any other input stream) into a std::string (and then print the content):

#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>

int main()
{
    std::string s(std::istreambuf_iterator<char>(std::ifstream("file")
                                                 >> std::skipws),
                  std::istreambuf_iterator<char>());
    std::cout << "file='" << s << "'\n";
}

这对于我自己的IOStreams实现当然是很快的,但是它需要很多骗子,以使其真正快。首先,它需要优化算法以应对分段序列:流可以看作是输入缓冲区的序列。我不知道有任何STL实施会持续进行此优化。 std :: skipws 的奇怪用法只是为了引用刚刚创建的流: std :: istreambuf_iterator< char> 需要临时文件流不会绑定到的引用。

This certainly is fast for my own implementation of IOStreams but it requires a lot of trickery to actually get it fast. Primarily, it requires optimizing algorithms to cope with segmented sequences: a stream can be seen as a sequence of input buffers. I'm not aware of any STL implementation consistently doing this optimization. The odd use of std::skipws is just to get reference to the just created stream: the std::istreambuf_iterator<char> expects a reference to which the temporary file stream wouldn't bind.

由于这可能不是最快的方法,因此我倾向于使用 std :: getline(),带有特定的换行符,即文件中未包含的字符:

Since this probably isn't the fastest approach, I would be inclined to use std::getline() with a particular "newline" character, i.e. on which isn't in the file:

std::string s;
// optionally reserve space although I wouldn't be too fuzzed about the
// reallocations because the reads probably dominate the performances
std::getline(std::ifstream("file") >> std::skipws, s, 0);

这假定文件不包含空字符。任何其他角色也一样。不幸的是, std :: getline()使用 char_type 作为定界参数,而不是 int_type 就是成员 std :: istream :: getline()用作分隔符的方式:在这种情况下,您可以使用 eof()表示永远不会出现的字符( char_type int_type eof() char_traits< char> 的相应成员。反过来,不能使用成员版本,因为您需要提前知道文件中有多少个字符。

This assumes that the file doesn't contain a null character. Any other character would do as well. Unfortunately, std::getline() takes a char_type as delimiting argument, rather than an int_type which is what the member std::istream::getline() takes for the delimiter: in this case you could use eof() for a character which never occurs (char_type, int_type, and eof() refer to the respective member of char_traits<char>). The member version, in turn, can't be used because you would need to know ahead of time how many characters are in the file.

BTW,我看到了一些尝试使用查找来确定文件的大小。这注定不会太好。问题在于,在 std :: ifstream 中完成的代码转换(实际上,在 std :: filebuf 中完成)可以创建的字符数与文件中的字节数不同。可以接受的是,使用默认的C语言环境不是这种情况,并且有可能检测到这没有进行任何转换。否则,流的最佳选择是遍历文件并确定所产生的字符数。实际上,我认为这是当代码转换可能很有趣时需要做的事情,尽管我认为实际上并没有完成。但是,这些示例都没有使用显式设置C语言环境的示例。 std :: locale :: global(std :: locale( C)); 。即使这样,也有必要以 std :: ios_base :: binary 模式打开文件,因为在读取时,否则行尾序列可能会替换为单个字符。诚然,这只会使结果更短,永远不会更长。

BTW, I saw some attempts to use seeking to determine the size of the file. This is bound not to work too well. The problem is that the code conversion done in std::ifstream (well, actually in std::filebuf) can create a different number of characters than there are bytes in the file. Admittedly, this isn't the case when using the default C locale and it is possible to detect that this doesn't do any conversion. Otherwise the best bet for the stream would be to run over the file and determine the number of characters being produced. I actually think that this is what would be needed to be done when the code conversion could something interesting although I don't think it actually is done. However, none of the examples explicitly set up the C locale, using e.g. std::locale::global(std::locale("C"));. Even with this it is also necessary to open the file in std::ios_base::binary mode because otherwise end of line sequences may be replaced by a single character when reading. Admittedly, this would only make the result shorter, never longer.

其他使用从 std :: streambuf * (即那些涉及 rdbuf()的对象)都要求在某个时刻复制结果内容。鉴于文件实际上可能很大,因此这不是一个选择。但是,如果没有副本,这很可能是最快的方法。为了避免复制,可以创建一个简单的自定义流缓冲区,该缓冲区将对 std :: string 的引用作为构造函数参数,并直接附加到此 std :: string

The other approaches using the extraction from std::streambuf* (i.e. those involving rdbuf()) all require that the resulting content is copied at some point. Given that the file may actually be very large this may not be an option. Without the copy this could very well be the fastest approach, however. To avoid the copy, it would be possible to create a simple custom stream buffer which takes a reference to a std::string as constructor argument and directly appends to this std::string:

#include <fstream>
#include <iostream>
#include <string>

class custombuf:
    public std::streambuf
{
public:
    custombuf(std::string& target): target_(target) {
        this->setp(this->buffer_, this->buffer_ + bufsize - 1);
    }

private:
    std::string& target_;
    enum { bufsize = 8192 };
    char buffer_[bufsize];
    int overflow(int c) {
        if (!traits_type::eq_int_type(c, traits_type::eof()))
        {
            *this->pptr() = traits_type::to_char_type(c);
            this->pbump(1);
        }
        this->target_.append(this->pbase(), this->pptr() - this->pbase());
        this->setp(this->buffer_, this->buffer_ + bufsize - 1);
        return traits_type::not_eof(c);
    }
    int sync() { this->overflow(traits_type::eof()); return 0; }
};

int main()
{
    std::string s;
    custombuf   sbuf(s);
    if (std::ostream(&sbuf)
        << std::ifstream("readfile.cpp").rdbuf()
        << std::flush) {
        std::cout << "file='" << s << "'\n";
    }
    else {
        std::cout << "failed to read file\n";
    }
}

至少我希望使用适当选择的缓冲区版本是相当快的。最快的版本肯定取决于系统,所使用的标准C ++库以及可能还有许多其他因素,例如,您要衡量性能。

At least with a suitably chosen buffer I would expect the version to be the fairly fast. Which version is the fastest will certainly depend on the system, the standard C++ library being used, and probably a number of other factors, i.e. you want to measure the performance.

这篇关于将文件读入std :: string的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆