雅典娜为镶木地板文件中的时间戳字段返回错误值 [英] Athena returns wrong values for timestamp fields in parquet files

查看:47
本文介绍了雅典娜为镶木地板文件中的时间戳字段返回错误值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我主要在这里重现我在论坛上看到的问题.aws 希望 stackoverflow 社区的回答/解释比论坛上的讨论更彻底、更具有启发性.

I am mostly reproducing here an issue that I have seen raised on forum.aws in hopes that the answers/explanations of the stackoverflow community are more thorough, and illuminating than the discussion on the forum.

以下是我对这个问题的经验:我使用 Pandas 从 python 中的数据帧制作镶木地板文件,并使用 pandas.to_datetime 将一个字段/列说生日作为 datetime64[ns].这部分过程似乎完美无缺,因为我可以使用 pandas.read_parquet 读取镶木地板文件并获得我期望的结果,即在 datetime 中输入的日期.但是,当我将所述镶木地板文件加载到 AWS 并在其上放置一个 athena 表时,读取相同的生日列会产生与镶木地板文件中的日期完全不匹配的垃圾日期.例如:

Here is my experience of the issue: I make a parquet files from a dataframe in python using pandas, and cast a field/column say birthday as a datetime64[ns] using pandas.to_datetime. This part of the process seems flawless as I can read the parquet files using pandas.read_parquet and get what I expect, namely the dates entered in datetime. However, when I load said parquet file to AWS and put a athena table on it, reading the same birthday column yields junk dates that in no way match the ones in the parquet file. For example:

t = pandas.DataFrame([['Haiti',pandas.to_datetime('1804-01-01')]],columns=['Country','Independence'])
t.to_parquet("s3://<mybucket>/tmp/t.parquet")

|Country | Independence|
|--------|-------------|
|Haiti   | 1804-01-01  |

CREATE EXTERNAL TABLE IF NOT EXISTS default.mytable (
  `Country` string,
  `Independence` timestamp 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://<mybucket>/tmp/'
TBLPROPERTIES ('has_encrypted_data'='false');

SELECT * FROM "default"."mytable" limit 10;

|Country | Independence             |
|--------|--------------------------|
|Haiti   |-164033-12-18 00:00:00.000|

推荐答案

您可以使用coerce_timestamps"强制 to_parquet 以 Athena 能够理解的格式编写:

You can force to_parquet to write in a format Athena will understand with "coerce_timestamps":

t = pandas.DataFrame([['Haiti',pandas.to_datetime('1804-01-01')]],columns=['Country','Independence'])
t.to_parquet("s3://<mybucket>/tmp/t.parquet", coerce_timestamps='ms')

|Country | Independence|
|--------|-------------|
|Haiti   | 1804-01-01  |

CREATE EXTERNAL TABLE IF NOT EXISTS default.mytable (
  `Country` string,
  `Independence` timestamp 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://<mybucket>/tmp/'
TBLPROPERTIES ('has_encrypted_data'='false');

SELECT * FROM "default"."mytable" limit 10;

|Country | Independence          |
|--------|-----------------------|
|Haiti   |1804-01-01 00:00:00.000|

这篇关于雅典娜为镶木地板文件中的时间戳字段返回错误值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆