AWS DMS:如何在Presto/Athena中处理TIMESTAMP_MICROS拼花地板字段 [英] AWS DMS: How to handle TIMESTAMP_MICROS parquet fields in Presto/Athena

查看:86
本文介绍了AWS DMS:如何在Presto/Athena中处理TIMESTAMP_MICROS拼花地板字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我们有一个DMS任务,该任务将获取MySQL DB的内容并将文件以镶木地板格式转储到S3.

Presently, we have a DMS task that will take the contents of a MySQL DB and dump files to S3 in parquet format.

实木复合地板中时间戳的格式最终为TIMESTAMP_MICROS.

The format for the timestamps in parquet ends up being TIMESTAMP_MICROS.

这是Presto(雅典娜的基本实现)的问题不支持以毫秒为单位的时间戳,并假设所有时间戳都以毫秒为单位.

This is a problem as Presto (the underlying implementation of Athena) does not support timestamps in microsecond precision and makes the assumption that all timestamps are in millisecond precision.

这不会直接导致任何错误,但会因为将微秒数解释为毫秒数而使时间显示为某个极端的未来日期.

This does not cause any errors directly but it makes the times display as some extreme future date as it is interpreting the number of microseconds as number of milliseconds.

我们目前正在解决此问题,方法是在Athena表之上创建Athena视图,并使用这些时间戳,将它们转换为unix时间戳(即数字),除以1000转换为毫秒分辨率,然后转换回a时间戳.

We are currently working around this by creating Athena views on top of the Athena tables that take these these timestamps, convert them to a unix timestamp (i.e. number), divide by 1000 to convert to millisecond resolution and then convert back to a timestamp.

这对于概念验证和确定问题很好,但是这对我们不起作用,因为我们有很多带有很多时间戳列的表,并且不想在顶部引入这个脆弱的视图层.

This was fine for a proof of concept and identifying the problem but this won't work for us as we have lots of tables with lots of timestamp columns and don't want to introduce this fragile view layer on top.

为Athena和DMS团队均创建了支持票证,它们均确认了问题并在其端创建了功能请求,以针对此不兼容的问题创建本机解决方案,但任何支持工程师都无法提供内置的解决方法.DMS似乎不支持将列类型更改为毫秒精度或varchar字符串.对于雅典娜来说,这只是Presto记录在案的限制.

Support tickets were created for both the Athena and DMS teams and both confirmed the problem and created feature requests on their end to create a native solution for this incompatible but neither support engineer could offer a built-in workaround. DMS does not seem to support changing the column type to be millisecond precision or a varchar string. For Athena, this is simply a documented Presto limitation.

有人以前遇到过这种情况并找到了解决方案吗?

Has anyone run into this before and found a solution?

更改MySQL数据库的精度并不是一个真正的选择,因为许多客户端都在使用它,并且我们不想冒险破坏接口.

Changing the precision in the MySQL database is not really an option as it is used by many clients and we don't want to risk breaking the interface.

我正在考虑创建一个作业,该作业将在创建新镶木地板时运行,以查找TIMESTAMP_MICRO字段并将其列为TIMESTAMP_MILLIS.不确定是否要执行此操作的最佳方法(胶水/火花作业?)

I am considering creating a job that will run anytime a new parquet is created to look for TIMESTAMP_MICRO fields and column them to TIMESTAMP_MILLIS. Not exactly sure the best way to go about this (Glue/Spark job?)

DMS非常棒,并且使转储到镶木地板非常简单,可能还有其他方法可以从MySQL提取数据.还有其他无需大量定制开发就可以做到这一点的工具吗?

DMS was great and has made the dump to parquet super simple, there may be other ways to extract the data from MySQL. Any other tools that can do this without much custom development?

推荐答案

使用目标端点,可以设置额外的连接属性".如果将 parquetTimestampInMillisecond 设置为 true ,则时间戳将以毫秒为单位.

With the target endpoint, you can set "Extra Connection Attributes". If you set parquetTimestampInMillisecond to true, the timestamps will be in milliseconds.

来源

这篇关于AWS DMS:如何在Presto/Athena中处理TIMESTAMP_MICROS拼花地板字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆