AWS Glue下推谓词无法正常运行 [英] AWS Glue pushdown predicate not working properly

查看:194
本文介绍了AWS Glue下推谓词无法正常运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过使用下推谓词来优化我的Glue/PySpark作业.

I'm trying to optimize my Glue/PySpark job by using push down predicates.

start = date(2019, 2, 13) 
end = date(2019, 2, 27) 
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
    Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")

relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()

flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
    querydatetime BETWEEN '%s' AND '%s'
    AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))

但是,看来Glue仍尝试读取指定日期范围之外的数据?

However it appears, that Glue still attempts to read data outside the specified date range?

INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.

请注意querydatetime=2019-03-01querydatetime=2019-03-10在指定的2019-02-13 - 2019-02-27范围之外.这就是为什么下一行中止HTTP连接"吗?它继续说这可能是一个错误,并且可能导致次佳的行为"是不是有问题?

Notice the querydatetime=2019-03-01 and querydatetime=2019-03-10 its outside the specified range of 2019-02-13 - 2019-02-27. Is that why there's the next line "aborting HTTP connection" tho? It goes on to say "This is likely an error and may result in sub-optimal behavior" is something wrong?

我想知道问题是因为它不支持谓词或IN内部的BETWEEN吗?

I wonder if the problem is because it does not support BETWEEN inside the predicate or IN?

表创建DDL

CREATE EXTERNAL TABLE `flights`(
  `id` string, 
  `querytaskid` string, 
  `queryoriginplace` string, 
  `queryoutbounddate` string, 
  `queryinbounddate` string, 
  `querycabinclass` string, 
  `querycurrency` string, 
  `agent` string, 
  `quoteageinminutes` string, 
  `price` string, 
  `outboundlegid` string, 
  `inboundlegid` string, 
  `outdeparture` string, 
  `outarrival` string, 
  `outduration` string, 
  `outjourneymode` string, 
  `outstops` string, 
  `outcarriers` string, 
  `outoperatingcarriers` string, 
  `numberoutstops` string, 
  `numberoutcarriers` string, 
  `numberoutoperatingcarriers` string, 
  `indeparture` string, 
  `inarrival` string, 
  `induration` string, 
  `injourneymode` string, 
  `instops` string, 
  `incarriers` string, 
  `inoperatingcarriers` string, 
  `numberinstops` string, 
  `numberincarriers` string, 
  `numberinoperatingcarriers` string)
PARTITIONED BY ( 
  `querydestinationplace` string, 
  `querydatetime` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://pinfare-glue/flights/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='pinfare-parquet', 
  'averageRecordSize'='19', 
  'classification'='parquet', 
  'compressionType'='none', 
  'objectCount'='623609', 
  'recordCount'='4368434222', 
  'sizeKey'='86509997099', 
  'typeOfData'='file')

推荐答案

为了降低条件,需要更改表定义子句中partition by列的列顺序

In order to push down your condition, you need to change the order of columns in your partition by clause of table definition

在第一个分区列上具有"in"谓词的条件无法按您期望的那样下推.

A condition having "in" predicate on first partition column can not be push down as you are expecting.

请帮助我.

这篇关于AWS Glue下推谓词无法正常运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆