针对基于CSV的Spark DataFrame的查询比基于Parquet的查询快吗? [英] Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

查看：113 发布时间：2020/9/4 6:08:48 apache-spark apache-spark-sql spark-dataframe parquet

本文介绍了针对基于CSV的Spark DataFrame的查询比基于Parquet的查询快吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我必须使用Spark将HDFS中的CSV文件加载到DataFrame中.我想知道从CSV文件支持的DataFrame到与镶木地板文件支持的DataFrame是否有性能"的提高(查询速度)?

I have to load up a CSV file from HDFS using Spark into DataFrame. I was wondering if there is a "performance" improvement (query speed) from a DataFrame backed by a CSV file vs one backed by a parquet file?

通常，我将如下所示的CSV文件加载到数据框中.

Typically, I load a CSV file like the following into a data frame.

val df1 = sqlContext.read
 .format("com.databricks.spark.csv")
 .option("header", "true")
 .option("inferSchema", "true")
 .load("hdfs://box/path/to/file.csv")

另一方面，加载一个镶木地板文件(假设我已经解析了CSV文件，创建了一个架构并将其保存到HDFS)如下所示.

On the other hand, loading a parquet file (assuming I've parsed the CSV file, created a schema, and saved it to HDFS) looks like the following.

val df2 = sqlContext.read.parquet("hdfs://box/path/to/file.parquet")

现在，我想知道像以下查询时间这样的操作是否会受到影响和/或不同.

Now I'm wondering if operations like the following query times would be impacted and/or different.

df1.where("col1 ='some1'").count()
df1.where("col1 ='some1'and col2 ='some2'").count()

我想知道是否有人知道实木复合地板是否有谓词推送功能?

I'm wondering if anyone knows if there is predicate-pushdown for parquet?

在我看来，镶木地板似乎有点像倒排索引，并且可以预期，对于基于镶木地板的数据帧，简单的计数过滤器将比CSV上的计数更快.至于CSV支持的数据框，我可以想象每次过滤项目时都必须进行完整的数据集扫描.

To me, it seems parquet is somewhat like an inverted-index, and it would be expected that simple filters for count would be faster for a data frame based on parquet than one on CSV. As for the CSV-backed data frame, I would imagine that a full data set scan would have to occur each time we filter for items.

对于CSV和镶木地板支持的数据帧查询性能的任何说明，都值得赞赏.此外，也欢迎使用任何有助于加快数据帧中查询计数的文件格式.

Any clarifications on CSV vs parquet-backed data frames query performance is appreciated. Also, any file format that will help in speeding up query counts in data frames is also welcomed.

针对基于CSV的Spark DataFrame的查询比基于Parquet的查询快吗? [英] Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

针对基于CSV的Spark DataFrame的查询比基于Parquet的查询快吗? [英] Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭