EMR-Presto和Athena之间的查询结果差异 [英] Query results difference between EMR-Presto and Athena

查看:447
本文介绍了EMR-Presto和Athena之间的查询结果差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将Glue目录连接到Athena和一个EMR实例(已预先安装)。我尝试在两者上运行相同的查询,但结果不同。 EMR提供0行,但Athena提供43行。使用左联接 group by 个不同的计数。查询看起来像这样:

I have connected Glue catalog to Athena and an EMR instance (with presto installed). I tried running the same query on both but am getting different results. EMR is giving 0 rows but Athena is giving 43 rows. The query is pretty simple with a left join, group by and a count distinct. The query looks like this:

select
  t1.customer_id as id,
  t2.purchase_date as purchase_date,
  count(distinct t1.purchase_id) as item_count
from 
  table1 t1
left join
  table2 as t2
  on t2.purchase_id=t1.purchase_id
where 
  t1.item_type='ABC' 
  and t1.purchase_status='CONFIRMED' 
  and t1.region_id in ('A','B','C')
  and t2.status='Dispatched'
  and t2.purchase_date between date_add('day',-50,date('2018-09-13')) and date('2018-09-13')
  and t1.created_at between date_add('day',-60,date('2018-09-13')) and date('2018-09-13')
  and t1.updated_at between date_add('day',-60,date('2018-09-13')) and date('2018-09-13')
group by
  t1.customer_id,t2.purchase_date;

我尝试了其他一些查询,但结果完全匹配。不确定此查询出了什么问题。

I tried some other queries but the results completely match. Not sure what is wrong with this query.

EMR Version: 5.17.0
Presto Version: 0.206

编辑:我意识到问题出在第一个表本身中。由于某种原因,Presto-EMR无法在 table1 中找到任何行。由于Presto-EMR和Athena都使用相同的Glue目录,因此不确定为什么会发生这种情况。我也在相同的EMR实例中尝试了Hive,它能够在 table1 中找到行。

I realised that the issue is in the first table itself. Presto-EMR is not able to find any rows in table1 for some reason. Not sure why this would happen since both Presto-EMR and Athena are using the same Glue catalog. I also tried Hive in the same EMR instance and it is able to find rows in table1.

select * from table1 limit 10;

上面的语句使用hive-sql给出10行,而使用presto-sql给出0行。我在调试模式下看到以下异常:

The above statement gives 10 rows with hive-sql but zero rows with presto-sql. I see the following exception in debug mode:

Query 20180917_075536_00023_4988g failed: com.facebook.presto.spi.type.TimestampType
java.lang.UnsupportedOperationException: com.facebook.presto.spi.type.TimestampType
    at com.facebook.presto.spi.type.AbstractType.writeSlice(AbstractType.java:135)
    at com.facebook.presto.hive.parquet.reader.ParquetBinaryColumnReader.readValue(ParquetBinaryColumnReader.java:55)
    at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.lambda$readValues$1(ParquetPrimitiveColumnReader.java:184)
    at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.processValues(ParquetPrimitiveColumnReader.java:204)
    at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readValues(ParquetPrimitiveColumnReader.java:183)
    at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readPrimitive(ParquetPrimitiveColumnReader.java:171)
    at com.facebook.presto.hive.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:208)
    at com.facebook.presto.hive.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:258)
    at com.facebook.presto.hive.parquet.reader.ParquetReader.readBlock(ParquetReader.java:241)
    at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:244)
    at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:222)
    at com.facebook.presto.spi.block.LazyBlock.assureLoaded(LazyBlock.java:262)
    at com.facebook.presto.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:253)
    at com.facebook.presto.spi.Page.getLoadedPage(Page.java:247)
    at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:245)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:373)
    at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:282)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:672)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:973)
    at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
    at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:477)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)


推荐答案

Presto默认情况下将Parquet中的字段与基于位置。如果您的字段顺序不同(例如,随着时间的推移写的不同),则需要启用按名称匹配。您可以使用 hive.properties

Presto by default matches fields in Parquet with table schema based on position. If order of your fields varies (eg. has been written differently over the time), you need to enable matching by name. You can do this with hive.properties:

hive.parquet.use-column-names = true

或在会话级别:

设置会话hive.parquet_use_column_names = true;

set session hive.parquet_use_column_names = true;

这是一个相关问题:> https://github.com/prestodb/presto/issues/8911

这篇关于EMR-Presto和Athena之间的查询结果差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆