EMR-Presto和Athena之间的查询结果差异 [英] Query results difference between EMR-Presto and Athena
问题描述
我已将Glue目录连接到Athena和一个EMR实例(已预先安装)。我尝试在两者上运行相同的查询,但结果不同。 EMR提供0行,但Athena提供43行。使用左联接
, group by
和个不同的计数。查询看起来像这样:
I have connected Glue catalog to Athena and an EMR instance (with presto installed). I tried running the same query on both but am getting different results. EMR is giving 0 rows but Athena is giving 43 rows. The query is pretty simple with a left join
, group by
and a count distinct
. The query looks like this:
select
t1.customer_id as id,
t2.purchase_date as purchase_date,
count(distinct t1.purchase_id) as item_count
from
table1 t1
left join
table2 as t2
on t2.purchase_id=t1.purchase_id
where
t1.item_type='ABC'
and t1.purchase_status='CONFIRMED'
and t1.region_id in ('A','B','C')
and t2.status='Dispatched'
and t2.purchase_date between date_add('day',-50,date('2018-09-13')) and date('2018-09-13')
and t1.created_at between date_add('day',-60,date('2018-09-13')) and date('2018-09-13')
and t1.updated_at between date_add('day',-60,date('2018-09-13')) and date('2018-09-13')
group by
t1.customer_id,t2.purchase_date;
我尝试了其他一些查询,但结果完全匹配。不确定此查询出了什么问题。
I tried some other queries but the results completely match. Not sure what is wrong with this query.
EMR Version: 5.17.0
Presto Version: 0.206
编辑:我意识到问题出在第一个表本身中。由于某种原因,Presto-EMR无法在 table1
中找到任何行。由于Presto-EMR和Athena都使用相同的Glue目录,因此不确定为什么会发生这种情况。我也在相同的EMR实例中尝试了Hive,它能够在 table1
中找到行。
I realised that the issue is in the first table itself. Presto-EMR is not able to find any rows in table1
for some reason. Not sure why this would happen since both Presto-EMR and Athena are using the same Glue catalog. I also tried Hive in the same EMR instance and it is able to find rows in table1
.
select * from table1 limit 10;
上面的语句使用hive-sql给出10行,而使用presto-sql给出0行。我在调试模式下看到以下异常:
The above statement gives 10 rows with hive-sql but zero rows with presto-sql. I see the following exception in debug mode:
Query 20180917_075536_00023_4988g failed: com.facebook.presto.spi.type.TimestampType
java.lang.UnsupportedOperationException: com.facebook.presto.spi.type.TimestampType
at com.facebook.presto.spi.type.AbstractType.writeSlice(AbstractType.java:135)
at com.facebook.presto.hive.parquet.reader.ParquetBinaryColumnReader.readValue(ParquetBinaryColumnReader.java:55)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.lambda$readValues$1(ParquetPrimitiveColumnReader.java:184)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.processValues(ParquetPrimitiveColumnReader.java:204)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readValues(ParquetPrimitiveColumnReader.java:183)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readPrimitive(ParquetPrimitiveColumnReader.java:171)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:208)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:258)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readBlock(ParquetReader.java:241)
at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:244)
at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:222)
at com.facebook.presto.spi.block.LazyBlock.assureLoaded(LazyBlock.java:262)
at com.facebook.presto.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:253)
at com.facebook.presto.spi.Page.getLoadedPage(Page.java:247)
at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:245)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:373)
at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:282)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:672)
at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:973)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:477)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
推荐答案
Presto默认情况下将Parquet中的字段与基于位置。如果您的字段顺序不同(例如,随着时间的推移写的不同),则需要启用按名称匹配。您可以使用 hive.properties
:
Presto by default matches fields in Parquet with table schema based on position. If order of your fields varies (eg. has been written differently over the time), you need to enable matching by name. You can do this with hive.properties
:
hive.parquet.use-column-names = true
或在会话级别:
设置会话hive.parquet_use_column_names = true;
set session hive.parquet_use_column_names = true;
这是一个相关问题:> https://github.com/prestodb/presto/issues/8911
这篇关于EMR-Presto和Athena之间的查询结果差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!