Spark是否支持对S3中的镶木地板文件进行真实的列扫描? [英] Does Spark support true column scans over parquet files in S3?

查看:78
本文介绍了Spark是否支持对S3中的镶木地板文件进行真实的列扫描?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Parquet数据存储格式的一大优点是它是专栏.如果我有一个包含数百个列的宽"数据集,但是我的查询仅涉及其中几列,则有可能只读取存储这几列的数据,而跳过其余列.

One of the great benefits of the Parquet data storage format is that it's columnar. If I've got a 'wide' dataset with hundreds of columns, but my query only touches a few of those, then it's possible read only the data that stores those few columns, and skip the rest.

大概此功能通过读取镶木地板文件开头的一些元数据来工作,该元数据指示文件系统在每一列中的位置.然后,读者可以在磁盘上搜索以仅读取必要的列.

Presumably this feature works by reading a bit of metadata at the head of a parquet file that indicates the locations on the filesystem for each column. The reader can then seek on disk to read in only the necessary columns.

有人知道spark的默认镶木地板阅读器是否在S3上正确实现了这种选择性搜索吗?我认为 S3支持,但是理论上的支持与正确利用支持的实现之间存在很大的差异.

Does anyone know whether spark's default parquet reader correctly implements this kind of selective seeking on S3? I think it's supported by S3, but there's a big difference between theoretical support and an implementation that properly exploits that support.

推荐答案

这需要细分

  1. Parquet代码是否从spark获得谓词(是)
  2. 然后,实木复合地板是否尝试使用Hadoop FileSystem seek() + read()readFully(position, buffer, length)调用选择性地仅读取那些列?是的
  3. S3连接器是否将这些文件操作转换为有效的HTTP GET请求?在Amazon EMR中:是.在Apache Hadoop中,您需要在类路径上使用hadoop 2.8,并设置正确的spark.hadoop.fs.s3a.experimental.fadvise=random来触发随机访问.
  1. Does the Parquet code get the predicates from spark (yes)
  2. Does parquet then attempt to selectively read only those columns, using the Hadoop FileSystem seek() + read() or readFully(position, buffer, length) calls? Yes
  3. Does the S3 connector translate these File Operations into efficient HTTP GET requests? In Amazon EMR: Yes. In Apache Hadoop, you need hadoop 2.8 on the classpath and set the properly spark.hadoop.fs.s3a.experimental.fadvise=random to trigger random access.

Hadoop 2.7和更早版本无法很好地处理文件的积极的seek(),因为它们始终会启动GET offset-of-file-of-file,使下一次查找感到惊讶,必须中止该连接,然后重新打开新的TCP/HTTPS 1.1连接(速度慢,CPU繁重),重复一次,重复一次.随机IO操作会损害.csv.gz之类的文件的批量加载,但对于获得ORC/Parquet性能而言至关重要.

Hadoop 2.7 and earlier handle the aggressive seek() round the file badly, because they always initiate a GET offset-end-of-file, get surprised by the next seek, have to abort that connection, reopen a new TCP/HTTPS 1.1 connection (slow, CPU heavy), do it again, repeatedly. The random IO operation hurts on bulk loading of things like .csv.gz, but is critical to getting ORC/Parquet perf.

您没有获得Hadoop 2.7的hadoop-aws JAR的加速.如果需要,则需要更新hadoop * .jar和依赖项,或者针对Hadoop 2.8从头开始构建Spark

You don't get the speedup on Hadoop 2.7's hadoop-aws JAR. If you need it you need to update hadoop*.jar and dependencies, or build Spark up from scratch against Hadoop 2.8

请注意,Hadoop 2.8+还具有一个不错的小功能:如果您在log语句中在S3A文件系统客户端上调用toString(),它将打印出所有文件系统IO状态,包括在搜索中丢弃了多少数据,中止了TCP连接和帮助您了解正在发生的事情.

Note that Hadoop 2.8+ also has a nice little feature: if you call toString() on an S3A filesystem client in a log statement, it prints out all the filesystem IO stats, including how much data was discarded in seeks, aborted TCP connections &c. Helps you work out what's going on.

2018-04-13警告::请勿尝试将Hadoop 2.8+ hadoop-aws JAR与其他hadoop-2.7 JAR集一起放在类路径上,并希望看到任何内容加速.您将看到的只是堆栈跟踪.您需要更新所有hadoop JAR及其传递依赖项.

2018-04-13 warning:: Do not try to drop the Hadoop 2.8+ hadoop-aws JAR on the classpath along with the rest of the hadoop-2.7 JAR set and expect to see any speedup. All you will see are stack traces. You need to update all the hadoop JARs and their transitive dependencies.

这篇关于Spark是否支持对S3中的镶木地板文件进行真实的列扫描?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆