Spark 是否支持对 S3 中的镶木地板文件进行真正的列扫描? [英] Does Spark support true column scans over parquet files in S3?

查看:34
本文介绍了Spark 是否支持对 S3 中的镶木地板文件进行真正的列扫描?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Parquet 数据存储格式的一大优点是它是柱状.如果我有一个包含数百列的宽"数据集,但我的查询只涉及其中的几列,那么可以只读取存储这几列的数据,并跳过其余列.

One of the great benefits of the Parquet data storage format is that it's columnar. If I've got a 'wide' dataset with hundreds of columns, but my query only touches a few of those, then it's possible read only the data that stores those few columns, and skip the rest.

据推测,此功能通过读取 parquet 文件开头的一些元数据来工作,该元数据指示每列在文件系统上的位置.然后,读者可以在磁盘上查找以仅读取必要的列.

Presumably this feature works by reading a bit of metadata at the head of a parquet file that indicates the locations on the filesystem for each column. The reader can then seek on disk to read in only the necessary columns.

有谁知道 spark 的默认 parquet reader 是否在 S3 上正确实现了这种选择性搜索?我认为 S3 支持,但理论支持与正确利用该支持的实现之间存在很大差异.

Does anyone know whether spark's default parquet reader correctly implements this kind of selective seeking on S3? I think it's supported by S3, but there's a big difference between theoretical support and an implementation that properly exploits that support.

推荐答案

这个需要分解

  1. Parquet 代码是否从 spark 获取谓词(是)
  2. parquet 是否会尝试使用 Hadoop FileSystem seek() + read() 有选择地仅读取这些列readFully(position, buffer, length) 调用?是的
  3. S3 连接器是否将这些文件操作转换为高效的 HTTP GET 请求?在亚马逊 EMR 中:是的.在 Apache Hadoop 中,您需要在类路径上安装 hadoop 2.8 并正确设置 spark.hadoop.fs.s3a.experimental.fadvise=random 以触发随机访问.
  1. Does the Parquet code get the predicates from spark (yes)
  2. Does parquet then attempt to selectively read only those columns, using the Hadoop FileSystem seek() + read() or readFully(position, buffer, length) calls? Yes
  3. Does the S3 connector translate these File Operations into efficient HTTP GET requests? In Amazon EMR: Yes. In Apache Hadoop, you need hadoop 2.8 on the classpath and set the properly spark.hadoop.fs.s3a.experimental.fadvise=random to trigger random access.

Hadoop 2.7 及更早版本对文件的主动搜索()处理很糟糕,因为它们总是启动一个 GET offset-end-of-file,对下一个搜索感到惊讶,不得不中止该连接,重新打开一个新的 TCP/HTTPS 1.1 连接(慢,CPU 重),再做一次,重复.随机 IO 操作会影响 .csv.gz 等内容的批量加载,但对于获得 ORC/Parquet 性能至关重要.

Hadoop 2.7 and earlier handle the aggressive seek() round the file badly, because they always initiate a GET offset-end-of-file, get surprised by the next seek, have to abort that connection, reopen a new TCP/HTTPS 1.1 connection (slow, CPU heavy), do it again, repeatedly. The random IO operation hurts on bulk loading of things like .csv.gz, but is critical to getting ORC/Parquet perf.

您无法在 Hadoop 2.7 的 hadoop-aws JAR 上获得加速.如果您需要它,您需要更新 hadoop*.jar 和依赖项,或者针对 Hadoop 2.8 从头构建 Spark

You don't get the speedup on Hadoop 2.7's hadoop-aws JAR. If you need it you need to update hadoop*.jar and dependencies, or build Spark up from scratch against Hadoop 2.8

请注意,Hadoop 2.8+ 还有一个不错的小功能:如果您在 S3A 文件系统客户端的日志语句中调用 toString(),它会打印出所有文件系统 IO 统计信息,包括有多少数据在搜索中被丢弃,中止 TCP 连接 &c.帮助您弄清楚发生了什么.

Note that Hadoop 2.8+ also has a nice little feature: if you call toString() on an S3A filesystem client in a log statement, it prints out all the filesystem IO stats, including how much data was discarded in seeks, aborted TCP connections &c. Helps you work out what's going on.

2018-04-13 警告::不要尝试将 Hadoop 2.8+ hadoop-aws JAR 连同 hadoop-2.7 的其余部分放在类路径上JAR 设置并期望看到任何加速.您将看到的只是堆栈跟踪.您需要更新所有 hadoop JAR 及其传递依赖项.

2018-04-13 warning:: Do not try to drop the Hadoop 2.8+ hadoop-aws JAR on the classpath along with the rest of the hadoop-2.7 JAR set and expect to see any speedup. All you will see are stack traces. You need to update all the hadoop JARs and their transitive dependencies.

这篇关于Spark 是否支持对 S3 中的镶木地板文件进行真正的列扫描?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆