S3 Select是否可以加快对Parquet文件的Spark分析? [英] Would S3 Select speed up Spark analyses on Parquet files?

查看：104 发布时间：2021/4/3 19:20:49 apache-spark amazon-s3 parquet

本文介绍了S3 Select是否可以加快对Parquet文件的Spark分析?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

您可以在亚马逊上使用 S3 Select with SparkEMR 和与Databricks一起，但仅适用于CSV和JSON文件.我猜想S3 Select没有提供用于列式文件格式，因为它没有太大帮助.

You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much.

假设我们有一个数据湖，其中有 first_name ， last_name 和 country 列.

Let's say we have a data lake of people with first_name, last_name and country columns.

如果数据存储为CSV文件，并且您运行 peopleDF.select("first_name").distinct().count()之类的查询，则S3将为所有ec2群集的列以运行计算.这确实效率很低，因为我们不需要所有 last_name 和 country 数据来运行此查询.

If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count(), then S3 will transfer all the data for all the columns to the ec2 cluster to run the computation. This is really inefficient because we don't need all the last_name and country data to run this query.

如果数据存储为CSV文件，并且您使用S3 select运行查询，则S3将仅传输 first_name 列中的数据以运行查询.

If the data is stored as CSV files and you run the query with S3 select, then S3 will only transfer the data in the first_name column to run the query.

spark
  .read
  .format("s3select")
  .schema(...)
  .options(...)
  .load("s3://bucket/filename")
  .select("first_name")
  .distinct()
  .count()

如果数据存储在Parquet数据湖中，并且运行了 peopleDF.select("first_name").distinct().count()，则S3将仅在 first_name 列.Parquet是一种柱状文件格式，这是其主要优点之一.

If the data is stored in a Parquet data lake and peopleDF.select("first_name").distinct().count() is run, then S3 will only transfer the data in the first_name column to the ec2 cluster. Parquet is a columnar file format and this is one of the main advantages.

因此，根据我的理解，S3 Select不能帮助加快对Parquet数据湖的分析，因为柱状文件格式提供了开箱即用的S3 Select优化.

So based on my understanding, S3 Select wouldn't help speed up an analysis on a Parquet data lake because columnar file formats offer the S3 Select optimization out of the box.

我不确定是因为同事确定我错了，还是因为

I am not sure because a coworker is certain I am wrong and because S3 Select supports the Parquet file format. Can you please confirm that columnar file formats provide the main optimization offered by S3 Select?

S3 Select是否可以加快对Parquet文件的Spark分析? [英] Would S3 Select speed up Spark analyses on Parquet files?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

S3 Select是否可以加快对Parquet文件的Spark分析? [英] Would S3 Select speed up Spark analyses on Parquet files?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭