针对基于CSV的Spark DataFrame的查询比基于Parquet的查询快吗? [英] Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?
问题描述
我必须使用Spark将HDFS中的CSV文件加载到DataFrame
中.我想知道从CSV文件支持的DataFrame到与镶木地板文件支持的DataFrame是否有性能"的提高(查询速度)?
I have to load up a CSV file from HDFS using Spark into DataFrame
. I was wondering if there is a "performance" improvement (query speed) from a DataFrame backed by a CSV file vs one backed by a parquet file?
通常,我将如下所示的CSV文件加载到数据框中.
Typically, I load a CSV file like the following into a data frame.
val df1 = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("hdfs://box/path/to/file.csv")
另一方面,加载一个镶木地板文件(假设我已经解析了CSV文件,创建了一个架构并将其保存到HDFS)如下所示.
On the other hand, loading a parquet file (assuming I've parsed the CSV file, created a schema, and saved it to HDFS) looks like the following.
val df2 = sqlContext.read.parquet("hdfs://box/path/to/file.parquet")
现在,我想知道像以下查询时间这样的操作是否会受到影响和/或不同.
Now I'm wondering if operations like the following query times would be impacted and/or different.
- df1.where("col1 ='some1'").count()
- df1.where("col1 ='some1'and col2 ='some2'").count()
我想知道是否有人知道实木复合地板是否有谓词推送功能?
I'm wondering if anyone knows if there is predicate-pushdown for parquet?
在我看来,镶木地板似乎有点像倒排索引,并且可以预期,对于基于镶木地板的数据帧,简单的计数过滤器将比CSV上的计数更快.至于CSV支持的数据框,我可以想象每次过滤项目时都必须进行完整的数据集扫描.
To me, it seems parquet is somewhat like an inverted-index, and it would be expected that simple filters for count would be faster for a data frame based on parquet than one on CSV. As for the CSV-backed data frame, I would imagine that a full data set scan would have to occur each time we filter for items.
对于CSV和镶木地板支持的数据帧查询性能的任何说明,都值得赞赏.此外,也欢迎使用任何有助于加快数据帧中查询计数的文件格式.
Any clarifications on CSV vs parquet-backed data frames query performance is appreciated. Also, any file format that will help in speeding up query counts in data frames is also welcomed.
推荐答案
CSV是面向行的格式,而Parquet是面向列的格式.
CSV is a row-oriented format, while Parquet is a column-oriented format.
通常,面向行的格式对于必须访问大多数列或仅读取一部分行的查询而言,效率更高.另一方面,面向列的格式通常对于需要读取大多数行但只需要访问一部分列的查询而言效率更高.分析查询通常属于后一类,而事务查询通常属于第一类.
Typically row-oriented formats are more efficient for queries that either must access most of the columns, or only read a fraction of the rows. Column-oriented formats, on the other hand, are usually more efficient for queries that need to read most of the rows, but only have to access a fraction of the columns. Analytical queries typically fall in the latter category, while transactional queries are more often in the first category.
此外,CSV是一种基于文本的格式,不能像二进制格式一样有效地进行解析.这会使CSV变得更慢.另一方面,典型的面向列的格式不仅是二进制的,而且还允许更有效的压缩,从而导致更小的磁盘使用和更快的访问.我建议阅读现代列式数据库的设计和实现的简介"部分.系统.
Additionally, CSV is a text-based format, which can not be parsed as efficiently as a binary format. This makes CSV even slower. A typical column-oriented format on the other hand is not only binary, but also allows more efficient compression, which leads to smaller disk usage and faster access. I recommend reading the Introduction section of The Design and Implementation of Modern Column-Oriented Database Systems.
由于Hadoop生态系统用于分析查询,因此对于Hadoop应用程序,Parquet通常是性能优于CSV的选择.
Since the Hadoop ecosystem is for analytical queries, Parquet is generally a better choice for performance than CSV for Hadoop applications.
这篇关于针对基于CSV的Spark DataFrame的查询比基于Parquet的查询快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!