S3并行读写性能如何? [英] S3 parallel read and write performance?

查看:443
本文介绍了S3并行读写性能如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑一个场景,其中Spark(或任何其他Hadoop框架)从S3读取大文件(例如1 TB).多个Spark执行程序如何从S3并行读取非常大的文件.在HDFS中,这个非常大的文件将分布在多个节点上,每个节点都有一个数据块.在对象存储中,我认为整个文件将位于单个节点中(忽略副本).这将大大降低读取吞吐量/性能.

Consider a scenario where Spark (or any other Hadoop framework) reads a large (say 1 TB) file from S3. How does multiple spark executors read the very large file in parallel from S3. In HDFS this very large file will be distributed across multiple nodes with each node having a block of data. In object storage I presume this entire file will be in single node (ignoring replicas). This should drastically reduce the read throughput/performance.

在HDFS中,类似的大文件写入也应该比S3快得多,因为HDFS中的写入将分布在多个主机上,而所有数据都必须通过S3中的一台主机(为简便起见,不考虑复制).

Similarly large file writes should also be much faster in HDFS than S3 because writes in HDFS would be spread across multiple hosts whereas all the data has to go through one host (ignoring replication for brevity) in S3.

这是否也意味着与大数据世界中的HDFS相比,S3的性能明显较差.

so does this mean the performance of S3 is significantly worse when compared to HDFS in the big data world.

推荐答案

是的,S3比HDFS慢.但是有趣的是为什么,以及如何减轻这种影响.关键:如果您要读取的数据多于写入的数据,则读取性能至关重要. Hadoop 2.8+中的S3A连接器确实为您提供了帮助,因为它已针对基于真实基准的痕迹进行了调整,可读取Parquet/ORC文件.写入性能也会受到影响,生成的数据越多,获取的性能越差.人们抱怨说,当他们真的应该担心以下事实时,如果不付出特殊的努力,您实际上可能最终得到无效的输出.通常这是更重要的问题-不太明显.

Yes, S3 is slower than HDFS. but it's interesting to look at why, and how to mitigate the impact. Key thing: if you are reading a lot more data than writing, then read performance is critical; the S3A connector in Hadoop 2.8+ really helps there, as it was tuned for reading Parquet/ORC files based on traces of real benchmarks. Write performance also suffers, and the more data you generate the worse it gets. People complain about that, when they should really be worrying about the fact that without special effort, you may actually end up with invalid output. That's generally the more important issue -just less obvious.

从S3中读取内容的原因是

Reading from S3 suffers due to

    S3和您的VM之间的
  • 带宽.您为EC2 VM支付的费用越多,获得的网络带宽就越好
  • HEAD/GET/LIST请求的
  • 延迟,尤其是在工作中用于使对象存储区使用的所有请求,看起来像是带有目录的文件系统.当列出了所有源文件并标识了要实际读取的源文件时,这尤其会损害查询的分区阶段.
  • 如果读取的HTTP连接被中止并且重新协商了新的HTTP连接,则seek()的开销将不可用.如果没有为此优化了seek()的连接器,ORC和Parquet输入将遭受严重损失.如果将fs.s3a.experimental.fadvise设置为random,则Hadoop 2.8+中的s3a连接器可以做到这一点.
  • bandwidth between S3 and your VM. The more you pay for an EC2 VM, the more network bandwidth you get, the better
  • latency of HEAD/GET/LIST requests, especially all those used in the work to make the object store look like a filesystem with directories. This can particularly hurt the partitioning phase of a query, when all the source files are listed and those to actually read identified.
  • Cost of seek() being awful if the HTTP connection for a read is aborted and a new one renegotiated. Without a connector which has optimized seek() for this, ORC and Parquet input suffers badly. the s3a connector in Hadoop 2.8+ does precisely this if you set fs.s3a.experimental.fadvise to random.

如果格式是可拆分的,Spark将拆分文件上的工作,并且使用的压缩格式也都是可拆分的(gz不是,snappy是).它将按块大小进行操作,这是您可以为特定作业(fs.s3a.block.size)配置/调整的功能. 如果> 1个客户端读取相同的文件,然后,是的,磁盘IO对该文件有一些重载,但与其他磁盘相比,通常是次要的.一个小秘密:对于分段上传的文件,读取单独的部分似乎可以避免这种情况,因此,以相同的配置块大小来进行上传和下载.

Spark will split up work on file if the format is splittable, and whatever compression format is used is also splittable (gz isn't, snappy is). It will do it on block size, which is something you can configure/tune for a specific job (fs.s3a.block.size). If > 1 client reads the same file, then yes, you get some overload of the disk IO to that file, but generally its minor compared to the rest. One little secret: for multipart uploaded files, reading separate parts seems to avoid this, so upload and download with the same configured block size.

写入性能受到影响

  • 在上传之前以块为单位缓存一些/很多MB数据,直到写入完成才开始上传. Hadoop 2.8+上的S3A:设置fs.s3a.fast.upload = true.
  • 网络上传带宽,也是您要付费的VM类型的功能.
  • caching of some/many MB of data in blocks before upload, with the upload not starting until the write is completed. S3A on hadoop 2.8+: set fs.s3a.fast.upload = true.
  • Network upload bandwidth, again a function of the VM type you pay for.

当通过写入临时位置的文件的rename()提交输出时,将每个对象复制到其最终路径的时间为6-10 MB/S.

When output is committed by rename() of the files written to a temporary location, the time to copy each object to its final path is 6-10 MB/S.

更大的问题是,在提交过程中处理不一致的目录列表或任务失败非常不好. 您不能安全地将S3用作普通的按提交重命名算法的直接工作目的地,而不能通过某种方式给您商店的一致视图(一致的emrfs,s3mper,s3guard).

A bigger issue is that it very bad at handling inconsistent directory listings or failures of tasks during the commit process. You cannot safely use S3 as a direct destination of work with the normal rename-by-commit algorithm without something to give you a consistent view of the store (consistent emrfs, s3mper, s3guard).

为了获得最佳性能和安全的工作提交,您需要一个针对S3优化的输出提交器.数据块在那里有自己的东西,Apache Hadoop 3.1添加了"S3A输出提交者". EMR现在显然也有东西.

For maximum performance and safe committing of work, you need an output committer optimized for S3. Databricks have their own thing there, Apache Hadoop 3.1 adds the "S3A output committer". EMR now apparently has something here too.

有关详细信息,请参见零重命名提交者.这个问题.之后,希望您可以使用安全的提交机制,也可以将HDFS用作工作目标.

See A zero rename committer for the details on that problem. After which, hopefully, you'll either move to a safe commit mechanism or use HDFS as a destination of work.

这篇关于S3并行读写性能如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆