查询基于 CSV 的 Spark DataFrame 是否比基于 Parquet 的查询更快? [英] Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

查看：29 发布时间：2021/11/14 21:54:00 apache-spark apache-spark-sql spark-dataframe parquet

本文介绍了查询基于 CSV 的 Spark DataFrame 是否比基于 Parquet 的查询更快?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我必须使用 Spark 从 HDFS 加载一个 CSV 文件到 DataFrame.我想知道由 CSV 文件支持的 DataFrame 与由 Parquet 文件支持的 DataFrame 是否有性能"改进(查询速度)?

I have to load up a CSV file from HDFS using Spark into DataFrame. I was wondering if there is a "performance" improvement (query speed) from a DataFrame backed by a CSV file vs one backed by a parquet file?

通常，我将如下所示的 CSV 文件加载到数据框中.

Typically, I load a CSV file like the following into a data frame.

val df1 = sqlContext.read
 .format("com.databricks.spark.csv")
 .option("header", "true")
 .option("inferSchema", "true")
 .load("hdfs://box/path/to/file.csv")

另一方面，加载镶木地板文件(假设我已经解析了 CSV 文件、创建了架构并将其保存到 HDFS)如下所示.

On the other hand, loading a parquet file (assuming I've parsed the CSV file, created a schema, and saved it to HDFS) looks like the following.

val df2 = sqlContext.read.parquet("hdfs://box/path/to/file.parquet")

现在我想知道像以下查询时间这样的操作是否会受到影响和/或不同.

Now I'm wondering if operations like the following query times would be impacted and/or different.

df1.where("col1='some1'").count()
df1.where("col1='some1' and col2='some2'").count()

我想知道是否有人知道镶木地板是否有谓词下推?

I'm wondering if anyone knows if there is predicate-pushdown for parquet?

对我来说，parquet 似乎有点像倒排索引，并且可以预期，对于基于 parquet 的数据框，简单的计数过滤器会比 CSV 上的过滤器更快.至于 CSV 支持的数据框，我想每次过滤项目时都必须进行完整的数据集扫描.

To me, it seems parquet is somewhat like an inverted-index, and it would be expected that simple filters for count would be faster for a data frame based on parquet than one on CSV. As for the CSV-backed data frame, I would imagine that a full data set scan would have to occur each time we filter for items.

对 CSV 与镶木地板支持的数据框查询性能的任何澄清表示赞赏.此外，也欢迎任何有助于加快数据框中查询计数的文件格式.

Any clarifications on CSV vs parquet-backed data frames query performance is appreciated. Also, any file format that will help in speeding up query counts in data frames is also welcomed.

查询基于 CSV 的 Spark DataFrame 是否比基于 Parquet 的查询更快? [英] Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

查询基于 CSV 的 Spark DataFrame 是否比基于 Parquet 的查询更快? [英] Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭