C ++标准规定iostreams性能不佳,或者我只是处理一个糟糕的实现? [英] Does the C++ standard mandate poor performance for iostreams, or am I just dealing with a poor implementation?

查看:136
本文介绍了C ++标准规定iostreams性能不佳,或者我只是处理一个糟糕的实现?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每次我提到C ++标准库iostreams的慢性能,我遇到了一个难以置信的浪潮。然而,我有profiler的结果显示大量的时间花在iostream库代码(完全编译器优化),从iostreams切换到操作系统特定的I / O API和自定义缓冲区管理确实提高了一个数量级。



C ++标准库做了什么额外的工作,标准是否需要它,它在实践中是否有用?或者让一些编译器提供与手动缓冲区管理有竞争力的iostream实现?



基准



移动,我写了几个短程序来执行iostreams内部缓冲:





请注意, ostringstream stringbuf 版本运行较少的迭代,因为它们非常慢。



在ideone上, ostringstream std:copy + back_inserter + std :: vector ,比 memcpy 转换为原始缓冲区。这与我在将真实应用切换到自定义缓冲时的前后分析一致。



这些都是内存缓冲区,因此iostreams的缓慢性可以'在缓慢的磁盘I / O,太多的刷新,与stdio同步,或任何其他事情,人们用来为理由观察到的C ++标准库iostream缓慢的指责。



这将是很高兴看到其他系统上的基准和注释常见的实现做的事情(如gcc的libc ++,Visual C + +,英特尔C + +)和多少的开销是标准强制。



此测试的原理



许多人已经正确地指出iostreams更常用于格式化输出。但是,它们也是C ++标准为二进制文件访问提供的唯一现代API。但是对内部缓冲进行性能测试的真正原因适用于典型的格式化I / O:如果iostreams不能保持磁盘控制器随原始数据一起提供,那么当它们负责格式化时,它们如何可能保持?



基准时间



这些都是外层( k )。



在ideone(gcc-4.3.4,未知的操作系统和硬件)

  • ostringstream :53毫秒

  • stringbuf :27 ms

  • 向量< char> back_inserter :17.6 ms

  • 使用普通迭代器的矢量< char> :10.6 ms

  • vector< char> 迭代器和边界检查:11.4 ms

  • char [] :3.7 ms



在我的笔记本电脑上(Visual C ++ 2010 x86, cl / Ox / EHsc ,Windows 7 Ultimate 64位,Intel Core i7,8 GB RAM):




  • ostringstream :73.4毫秒,71.6毫秒

  • stringbuf :21.7 ms,21.3 ms

  • 向量< char> back_inserter :34.6 ms,34.4 ms

  • 向量< char> 使用普通迭代器:1.10 ms,1.04 ms

  • vector< char> 迭代器和边界检查:1.11 ms,0.87 ms,1.12 ms,0.89 ms,1.02 ms,1.14 ms

  • char [] :1.48 ms,1.57 ms



Visual C ++ 2010 x86使用配置文件引导优化 cl / Ox / EHsc / GL / c link / ltcg:pgi ,run, link / ltcg:pgo ,measure:




  • ostringstream :61.2 ms,60.5 ms

  • 使用普通迭代器的矢量< char> :1.04 ms,1.03 ms



相同的笔记本电脑,同一操作系统,使用cygwin gcc 4.3.4 g ++ -O3




  • ostringstream :62.7 ms,60.5 ms

  • stringbuf :44.4 ms,44.5 ms

  • 向量< char> back_inserter :13.5 ms,13.6 ms

  • 使用普通迭代器的矢量< char> :4.1 ms, >
  • 向量< char> 迭代器和边界检查:4.0 ms,4.0 ms

  • char [] :3.57 ms,3.75 ms



同样的笔记本电脑,Visual C ++ 2008 SP1, code> cl / Ox / EHsc




  • ostringstream :88.7 ms,87.6 ms

  • stringbuf :23.3 ms,23.4 ms

  • 向量< char> back_inserter :26.1 ms,24.5 ms

  • 向量< char> 使用普通迭代器:3.13 ms,2.48 ms

  • vector< char> code>迭代器和边界检查:2.97 ms,2.53 ms

  • char [] :1.52 ms, li>


相同的笔记本电脑,Visual C ++ 2010 64位编译器:




  • ostringstream :48.6 ms,45.0 ms

  • stringbuf 16.2 ms,16.0 ms

  • 向量< char> back_inserter :26.3 ms,26.5 ms

  • 使用普通迭代器的向量< char> :0.87 ms,0.89 ms

  • 向量< char> 迭代器和边界检查:0.99 ms,0.99 ms

  • / code>:1.25 ms,1.24 ms



编辑:两次查看结果的一致性。非常一致的IMO。



注意:在我的笔记本电脑上,由于我可以节省更多的CPU时间比ideone允许,我为所有方法设置迭代次数为1000。这意味着 ostringstream 向量重新分配只在第一次通过时发生,应该对最终的影响很小结果。



编辑:哎呀,发现在向量中的一个错误-with-ordinary-iterator,迭代器wasn因为缓存命中太多。我想知道向量< char> 如何超过 char [] 。虽然在VC ++ 2010下,向量< char> 仍然比 char []



结论



每次附加数据时,输出流的缓冲需要三个步骤:




  • 检查传入的块是否符合可用的缓冲区空间。

  • 复制传入的块。

  • 更新数据结尾指针。



我发布的最新代码片段是 vector< char> 简单迭代器加边界检查不仅如此,它还分配额外的空间,并在传入块不适合时移动现有数据。正如Clifford所指出的,在文件I / O类中的缓冲不一定要这样做,它只是刷新当前缓冲区并重用它。所以这应该是缓冲输出成本的上限。



那么为什么 stringbuf 2.5x更慢在ideone上,至少慢10倍,当我测试它?

解决方案

不回答这个问题,因为它没有在这个简单的微基准中被多态使用,您的问题的细节,如标题:2006年 C ++性能技术报告在IOStreams有一个有趣的部分(p.68)。与您的问题最相关的是第6.1.2节(执行速度):


由于IOStreams处理的某些方面是
分布在多个方面,它
似乎标准规定了一个
的低效实施。但是这个
不是这样 - 通过使用某种形式
的预处理,大部分工作可以避免
。使用比通常使用的稍微更聪明的
链接器,
可以删除这些
低效率中的一些。这在
§6.2.3和§6.2.5中讨论。


自从报告写于2006年以来,希望许多建议可以纳入当前的编译器,但也许不是这样。



正如你所提到的,facets可能不在 write()(但我不会盲目假设)。那么功能是什么?在使用GCC编译的 ostringstream 代码上运行GProf可以得到以下细分:




  • 44.23 %in std :: basic_streambuf< char> :: xsputn(char const *,int)

  • 34.62%in std :: ostream :: write(char const *,int)

  • 12.50%in main

  • 6.73%在 std :: ostream :: sentry :: sentry(std :: ostream&)

  • 0.96%in std :: string :: _ M_replace_safe(unsigned int,unsigned int,char const *,unsigned int)

  • 0.96%在 std :: basic_ostringstream< char> :: basic_ostringstream(std :: _ Ios_Openmode)

  • 0.00% std :: fpos< int> :: fpos(long long)



<所以大部分时间花费在 xsputn 中,它在大量检查后最终调用 std :: copy()和更新光标位置和缓冲区(有关详细信息,请查看 c ++ \bits\streambuf.tcc )。



我的看法是,你专注于最坏的情况。如果你处理相当大的数据块,所执行的所有检查都将是完成的总工作的一小部分。但是你的代码一次以四个字节移动数据,并且每次产生所有额外的成本。显然,在现实生活情况下避免这样做 - 考虑如果 write 在1m int数组上调用,而不是在1m次上调用的惩罚会是多么微不足道一个。在现实生活的情况下,人们会真正体会到IOStreams的重要特性,即它的内存安全和类型安全的设计。这样的好处是有代价的,你写了一个测试,让这些成本占据执行时间。


Every time I mention slow performance of C++ standard library iostreams, I get met with a wave of disbelief. Yet I have profiler results showing large amounts of time spent in iostream library code (full compiler optimizations), and switching from iostreams to OS-specific I/O APIs and custom buffer management does give an order of magnitude improvement.

What extra work is the C++ standard library doing, is it required by the standard, and is it useful in practice? Or do some compilers provide implementations of iostreams that are competitive with manual buffer management?

Benchmarks

To get matters moving, I've written a couple of short programs to exercise the iostreams internal buffering:

Note that the ostringstream and stringbuf versions run fewer iterations because they are so much slower.

On ideone, the ostringstream is about 3 times slower than std:copy + back_inserter + std::vector, and about 15 times slower than memcpy into a raw buffer. This feels consistent with before-and-after profiling when I switched my real application to custom buffering.

These are all in-memory buffers, so the slowness of iostreams can't be blamed on slow disk I/O, too much flushing, synchronization with stdio, or any of the other things people use to excuse observed slowness of the C++ standard library iostream.

It would be nice to see benchmarks on other systems and commentary on things common implementations do (such as gcc's libc++, Visual C++, Intel C++) and how much of the overhead is mandated by the standard.

Rationale for this test

A number of people have correctly pointed out that iostreams are more commonly used for formatted output. However, they are also the only modern API provided by the C++ standard for binary file access. But the real reason for doing performance tests on the internal buffering applies to the typical formatted I/O: if iostreams can't keep the disk controller supplied with raw data, how can they possibly keep up when they are responsible for formatting as well?

Benchmark Timing

All these are per iteration of the outer (k) loop.

On ideone (gcc-4.3.4, unknown OS and hardware):

  • ostringstream: 53 milliseconds
  • stringbuf: 27 ms
  • vector<char> and back_inserter: 17.6 ms
  • vector<char> with ordinary iterator: 10.6 ms
  • vector<char> iterator and bounds check: 11.4 ms
  • char[]: 3.7 ms

On my laptop (Visual C++ 2010 x86, cl /Ox /EHsc, Windows 7 Ultimate 64-bit, Intel Core i7, 8 GB RAM):

  • ostringstream: 73.4 milliseconds, 71.6 ms
  • stringbuf: 21.7 ms, 21.3 ms
  • vector<char> and back_inserter: 34.6 ms, 34.4 ms
  • vector<char> with ordinary iterator: 1.10 ms, 1.04 ms
  • vector<char> iterator and bounds check: 1.11 ms, 0.87 ms, 1.12 ms, 0.89 ms, 1.02 ms, 1.14 ms
  • char[]: 1.48 ms, 1.57 ms

Visual C++ 2010 x86, with Profile-Guided Optimization cl /Ox /EHsc /GL /c, link /ltcg:pgi, run, link /ltcg:pgo, measure:

  • ostringstream: 61.2 ms, 60.5 ms
  • vector<char> with ordinary iterator: 1.04 ms, 1.03 ms

Same laptop, same OS, using cygwin gcc 4.3.4 g++ -O3:

  • ostringstream: 62.7 ms, 60.5 ms
  • stringbuf: 44.4 ms, 44.5 ms
  • vector<char> and back_inserter: 13.5 ms, 13.6 ms
  • vector<char> with ordinary iterator: 4.1 ms, 3.9 ms
  • vector<char> iterator and bounds check: 4.0 ms, 4.0 ms
  • char[]: 3.57 ms, 3.75 ms

Same laptop, Visual C++ 2008 SP1, cl /Ox /EHsc:

  • ostringstream: 88.7 ms, 87.6 ms
  • stringbuf: 23.3 ms, 23.4 ms
  • vector<char> and back_inserter: 26.1 ms, 24.5 ms
  • vector<char> with ordinary iterator: 3.13 ms, 2.48 ms
  • vector<char> iterator and bounds check: 2.97 ms, 2.53 ms
  • char[]: 1.52 ms, 1.25 ms

Same laptop, Visual C++ 2010 64-bit compiler:

  • ostringstream: 48.6 ms, 45.0 ms
  • stringbuf: 16.2 ms, 16.0 ms
  • vector<char> and back_inserter: 26.3 ms, 26.5 ms
  • vector<char> with ordinary iterator: 0.87 ms, 0.89 ms
  • vector<char> iterator and bounds check: 0.99 ms, 0.99 ms
  • char[]: 1.25 ms, 1.24 ms

EDIT: Ran all twice to see how consistent the results were. Pretty consistent IMO.

NOTE: On my laptop, since I can spare more CPU time than ideone allows, I set the number of iterations to 1000 for all methods. This means that ostringstream and vector reallocation, which takes place only on the first pass, should have little impact on the final results.

EDIT: Oops, found a bug in the vector-with-ordinary-iterator, the iterator wasn't being advanced and therefore there were too many cache hits. I was wondering how vector<char> was outperforming char[]. It didn't make much difference though, vector<char> is still faster than char[] under VC++ 2010.

Conclusions

Buffering of output streams requires three steps each time data is appended:

  • Check that the incoming block fits the available buffer space.
  • Copy the incoming block.
  • Update the end-of-data pointer.

The latest code snippet I posted, "vector<char> simple iterator plus bounds check" not only does this, it also allocates additional space and moves the existing data when the incoming block doesn't fit. As Clifford pointed out, buffering in a file I/O class wouldn't have to do that, it would just flush the current buffer and reuse it. So this should be an upper bound on the cost of buffering output. And it's exactly what is needed to make a working in-memory buffer.

So why is stringbuf 2.5x slower on ideone, and at least 10 times slower when I test it? It isn't being used polymorphically in this simple micro-benchmark, so that doesn't explain it.

解决方案

Not answering the specifics of your question so much as the title: the 2006 Technical Report on C++ Performance has an interesting section on IOStreams (p.68). Most relevant to your question is in Section 6.1.2 ("Execution Speed"):

Since certain aspects of IOStreams processing are distributed over multiple facets, it appears that the Standard mandates an inefficient implementation. But this is not the case — by using some form of preprocessing, much of the work can be avoided. With a slightly smarter linker than is typically used, it is possible to remove some of these inefficiencies. This is discussed in §6.2.3 and §6.2.5.

Since the report was written in 2006 one would hope that many of the recommendations would have been incorporated into current compilers, but perhaps this is not the case.

As you mention, facets may not feature in write() (but I wouldn't assume that blindly). So what does feature? Running GProf on your ostringstream code compiled with GCC gives the following breakdown:

  • 44.23% in std::basic_streambuf<char>::xsputn(char const*, int)
  • 34.62% in std::ostream::write(char const*, int)
  • 12.50% in main
  • 6.73% in std::ostream::sentry::sentry(std::ostream&)
  • 0.96% in std::string::_M_replace_safe(unsigned int, unsigned int, char const*, unsigned int)
  • 0.96% in std::basic_ostringstream<char>::basic_ostringstream(std::_Ios_Openmode)
  • 0.00% in std::fpos<int>::fpos(long long)

So the bulk of the time is spent in xsputn, which eventually calls std::copy() after lots of checking and updating of cursor positions and buffers (have a look in c++\bits\streambuf.tcc for the details).

My take on this is that you've focused on the worst-case situation. All the checking that is performed would be a small fraction of the total work done if you were dealing with reasonably large chunks of data. But your code is shifting data in four bytes at a time, and incurring all the extra costs each time. Clearly one would avoid doing so in a real-life situation - consider how negligible the penalty would have been if write was called on an array of 1m ints instead of on 1m times on one int. And in a real-life situation one would really appreciate the important features of IOStreams, namely its memory-safe and type-safe design. Such benefits come at a price, and you've written a test which makes these costs dominate the execution time.

这篇关于C ++标准规定iostreams性能不佳,或者我只是处理一个糟糕的实现?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆