火花读取分区avro的速度明显比指向精确位置慢 [英] Spark reading Partitioned avro significantly slower than pointing to exact location

查看:69
本文介绍了火花读取分区avro的速度明显比指向精确位置慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取分区的Avro数据,该数据是根据Year,Month和Day进行分区的,这似乎比直接将其指向路径要慢得多.在物理计划"中,我可以看到分区过滤器正在传递,因此它没有扫描整个目录集,但速度仍然慢得多.

I am trying to read partitioned Avro data which is partitioned based on Year, Month and Day and that seems to be significantly slower than pointing it directly to the path. In the Physical plan I can see that the partition filters are getting passed on, so it is not scanning the entire set of directories but still it is significantly slower.

例如像这样读取分区数据

E.g. reading the partitioned data like this

profitLossPath="abfss://raw@"+datalakename+".dfs.core.windows.net/datawarehouse/CommercialDM.ProfitLoss/"
 
profitLoss = spark.read.\
    format("com.databricks.spark.avro").\
    option("header", "false").\
    option("inferSchema", "false").load(profitLossPath)
 
profitLoss.createOrReplaceTempView("ProfitLosstt")

df=sqlContext.sql("SELECT * \
                             FROM ProfitLosstt \
                             where Year= " + year + " and Month=" + month_nz + " and Day=" + date_nz )

大约需要3分钟

这是我使用字符串生成器指向确切位置的过程,只需2秒钟即可完成

whereas this where I point to the exact location using a string builder, gets done in 2 secs

profitLossPath="abfss://raw@"+datalakename+".dfs.core.windows.net/datawarehouse/CommercialDM.ProfitLoss/Year=" +year +"/Month=" + month_nz + "/Day=" + date_nz
 
profitLoss = spark.read.\
    format("com.databricks.spark.avro").\
    option("header", "false").\
    option("inferSchema", "false").load(profitLossPath)

 
profitLoss.createOrReplaceTempView("ProfitLosstt")

df=sqlContext.sql("SELECT * \
                             FROM ProfitLosstt "
                              )
                  
display(df)

查看第一个计划的物理计划(速度较慢),确实表明分区过滤器已通过

Looking at the physical plan for the 1st one (slower) does show that the partition filter is passed on

什么能解释这么长时间的发现阶段?

What could explain the discovery phase taking this long?

任何问题,我都可以详细说明.

Any questions and I can elaborate.

推荐答案

好,速度慢的原因是由于InMemoryFileIndex的建立.

Ok,the reason for the slowness was because of building of InMemoryFileIndex.

尽管会进行分区修剪,但Spark需要了解分区和文件信息,而这正是它需要执行的步骤.该SO帖子对此进行了详细说明:这里

Though partitioning pruning takes place, Spark needs to know about the partition and file info and that's where it needs that step. This S.O post elaborates it : here

因此,当时的想法是创建一个外部表,以便构建此信息,这是我使用类似这样的脚本完成的(我使用了内联模式,如果有的话可以使用模式文件)

So, the idea was then to create an external table, so that this information is built, which I did using a script like this (I used an inline schema, you can use a schema file if you have one)

create external table ProfitLossAvro 


partitioned by (Year int, Month int, Day int)

ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'


Stored As 

 inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

 outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

Location 'abfss://raw@datalakename.dfs.core.windows.net/datawarehouse/CommercialDM.ProfitLoss/'

TBLPROPERTIES (
    'avro.schema.literal'='{
      "name": "Microsoft.Hadoop.Avro.Specifications.ProfitLoss",
      "type": "record",
      "fields": [{ "name":"MK_DatesID_TradeDate", "type":["int", "null"]},{ "name":"MK_UCRAccountsID_AccountID", "type":["int", "null"]},{ "name":"MK_ProductCategoriesID_ProductCategoryID", "type":["int", "null"]},{ "name":"CurrencyCode", "type":["string", "null"]},{ "name":"ProfitLoss", "type":["double", "null"]},{ "name":"MK_PnLAmountTypesID_PLBookingTypeID", "type":["int", "null"]}]
    }');

但是,如果您随后查询该表,则会得到0行.这是因为现有分区不会自动添加.因此,您可以使用

But if you then query this table, you will get 0 rows. This is because the existing partitions are not added automatically. So, you can do that using

msck repair table ProfitLossAvro

,并且每次将数据添加到数据湖时,都可以添加分区.像这样:-

and everytime you add data to your datalake, you can do an add partition. Something like this:-

ALTER TABLE ProfitLossAvro ADD PARTITION (Year=2020, Month=6, Day=26)

如果您随后使用类似以下的命令查询数据,它将更快地工作

If you then query your data using a command like the below, it will work much faster

df=sqlContext.sql("select * \
               from ProfitLossAvro \
               where Year=" + year + " and Month=" + month_nz + " and Day=" + date_nz)

display(df)

这篇关于火花读取分区avro的速度明显比指向精确位置慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆