S3 Select是否可以加快对Parquet文件的Spark分析? [英] Would S3 Select speed up Spark analyses on Parquet files?

查看:104
本文介绍了S3 Select是否可以加快对Parquet文件的Spark分析?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您可以在亚马逊上使用 S3 Select with SparkEMR 与Databricks一起,但仅适用于CSV和JSON文件.我猜想S3 Select没有提供用于列式文件格式,因为它没有太大帮助.

You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much.

假设我们有一个数据湖,其中有 first_name last_name country 列.

Let's say we have a data lake of people with first_name, last_name and country columns.

如果数据存储为CSV文件,并且您运行 peopleDF.select("first_name").distinct().count()之类的查询,则S3将为所有ec2群集的列以运行计算.这确实效率很低,因为我们不需要所有 last_name country 数据来运行此查询.

If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count(), then S3 will transfer all the data for all the columns to the ec2 cluster to run the computation. This is really inefficient because we don't need all the last_name and country data to run this query.

如果数据存储为CSV文件,并且您使用S3 select运行查询,则S3将仅传输 first_name 列中的数据以运行查询.

If the data is stored as CSV files and you run the query with S3 select, then S3 will only transfer the data in the first_name column to run the query.

spark
  .read
  .format("s3select")
  .schema(...)
  .options(...)
  .load("s3://bucket/filename")
  .select("first_name")
  .distinct()
  .count()

如果数据存储在Parquet数据湖中,并且运行了 peopleDF.select("first_name").distinct().count(),则S3将仅在 first_name 列.Parquet是一种柱状文件格式,这是其主要优点之一.

If the data is stored in a Parquet data lake and peopleDF.select("first_name").distinct().count() is run, then S3 will only transfer the data in the first_name column to the ec2 cluster. Parquet is a columnar file format and this is one of the main advantages.

因此,根据我的理解,S3 Select不能帮助加快对Parquet数据湖的分析,因为柱状文件格式提供了开箱即用的S3 Select优化.

So based on my understanding, S3 Select wouldn't help speed up an analysis on a Parquet data lake because columnar file formats offer the S3 Select optimization out of the box.

我不确定是因为同事确定我错了,还是因为

I am not sure because a coworker is certain I am wrong and because S3 Select supports the Parquet file format. Can you please confirm that columnar file formats provide the main optimization offered by S3 Select?

推荐答案

这是一个有趣的问题.我没有任何实数,尽管我已经在hadoop-aws模块中完成了S3选择绑定代码.Amazon EMR具有一些价值,数据砖也具有一些价值.

This is an interesting question. I don't have any real numbers, though I have done the S3 select binding code in the hadoop-aws module. Amazon EMR have some values, as do databricks.

对于CSV IO是,如果对源数据进行积极过滤(例如,许多GB的数据但返回的数据不多),S3 Select将加快速度.为什么?尽管读取速度较慢,但​​可以节省VM的有限带宽.

For CSV IO Yes, S3 Select will speedup given aggressive filtering of source data, e.g many GB of data but not much back. Why? although the read is slower, you save on the limited bandwidth to your VM.

但是对于Parquet,工作人员将一个大文件拆分为多个部分,并在其中安排工作(假设使用了可拆分的压缩格式,如snappy),因此> 1个工作人员可以处理同一个文件.而且他们只读取了一部分数据(==减少了带宽带来的好处),但是他们确实在该文件中四处寻找(==需要优化寻找策略,否则中止和重​​新打开HTTP连接的成本)

For Parquet though, the workers split up a large file into parts and schedule the work across them (Assuming a splittable compression format like snappy is used), so > 1 worker can work on the same file. And they only read a fraction of the data (==bandwidth benefits less), But they do seek around in that file (==need to optimise seek policy else cost of aborting and reopening HTTP connections)

我不认为如果集群中有足够的容量,并且您已经调整了s3客户端设置(对于s3a,这意味着:查找策略,线程池大小,http,池大小)以提高性能.

I'm not convinced that Parquet reads in the S3 cluster can beat a spark cluster if there's enough capacity in the cluster and you've tuned your s3 client settings (for s3a this means: seek policy, thread pool size, http pool size) for performance too.

就像我说的那样:我不确定.欢迎数字.

Like I said though: I'm not sure. Numbers are welcome.

这篇关于S3 Select是否可以加快对Parquet文件的Spark分析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆