查询基于 CSV 的 Spark DataFrame 是否比基于 Parquet 的查询更快? [英] Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

查看:29
本文介绍了查询基于 CSV 的 Spark DataFrame 是否比基于 Parquet 的查询更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用 Spark 从 HDFS 加载一个 CSV 文件到 DataFrame.我想知道由 CSV 文件支持的 DataFrame 与由 Parquet 文件支持的 DataFrame 是否有性能"改进(查询速度)?

I have to load up a CSV file from HDFS using Spark into DataFrame. I was wondering if there is a "performance" improvement (query speed) from a DataFrame backed by a CSV file vs one backed by a parquet file?

通常,我将如下所示的 CSV 文件加载到数据框中.

Typically, I load a CSV file like the following into a data frame.

val df1 = sqlContext.read
 .format("com.databricks.spark.csv")
 .option("header", "true")
 .option("inferSchema", "true")
 .load("hdfs://box/path/to/file.csv")

另一方面,加载镶木地板文件(假设我已经解析了 CSV 文件、创建了架构并将其保存到 HDFS)如下所示.

On the other hand, loading a parquet file (assuming I've parsed the CSV file, created a schema, and saved it to HDFS) looks like the following.

val df2 = sqlContext.read.parquet("hdfs://box/path/to/file.parquet")

现在我想知道像以下查询时间这样的操作是否会受到影响和/或不同.

Now I'm wondering if operations like the following query times would be impacted and/or different.

  • df1.where("col1='some1'").count()
  • df1.where("col1='some1' and col2='some2'").count()

我想知道是否有人知道镶木地板是否有谓词下推?

I'm wondering if anyone knows if there is predicate-pushdown for parquet?

对我来说,parquet 似乎有点像倒排索引,并且可以预期,对于基于 parquet 的数据框,简单的计数过滤器会比 CSV 上的过滤器更快.至于 CSV 支持的数据框,我想每次过滤项目时都必须进行完整的数据集扫描.

To me, it seems parquet is somewhat like an inverted-index, and it would be expected that simple filters for count would be faster for a data frame based on parquet than one on CSV. As for the CSV-backed data frame, I would imagine that a full data set scan would have to occur each time we filter for items.

对 CSV 与镶木地板支持的数据框查询性能的任何澄清表示赞赏.此外,也欢迎任何有助于加快数据框中查询计数的文件格式.

Any clarifications on CSV vs parquet-backed data frames query performance is appreciated. Also, any file format that will help in speeding up query counts in data frames is also welcomed.

推荐答案

CSV 是面向行的格式,而 Parquet 是面向列的格式.

CSV is a row-oriented format, while Parquet is a column-oriented format.

通常,面向行的格式对于必须访问大部分列或仅读取部分行的查询更有效.另一方面,面向列的格式对于需要读取大部分行但只需要访问一小部分列的查询通常更有效.分析查询通常属于后一类,而事务性查询更常属于第一类.

Typically row-oriented formats are more efficient for queries that either must access most of the columns, or only read a fraction of the rows. Column-oriented formats, on the other hand, are usually more efficient for queries that need to read most of the rows, but only have to access a fraction of the columns. Analytical queries typically fall in the latter category, while transactional queries are more often in the first category.

此外,CSV 是一种基于文本的格式,无法像二进制格式那样高效地解析.这使得 CSV 更慢.另一方面,典型的面向列的格式不仅是二进制的,而且允许更有效的压缩,从而导致更少的磁盘使用和更快的访问.我建议阅读 现代面向列的数据库系统的设计与实现.

Additionally, CSV is a text-based format, which can not be parsed as efficiently as a binary format. This makes CSV even slower. A typical column-oriented format on the other hand is not only binary, but also allows more efficient compression, which leads to smaller disk usage and faster access. I recommend reading the Introduction section of The Design and Implementation of Modern Column-Oriented Database Systems.

由于 Hadoop 生态系统用于分析查询,因此对于 Hadoop 应用程序而言,Parquet 通常是比 CSV 更好的性能选择.

Since the Hadoop ecosystem is for analytical queries, Parquet is generally a better choice for performance than CSV for Hadoop applications.

这篇关于查询基于 CSV 的 Spark DataFrame 是否比基于 Parquet 的查询更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆