雅典娜无法使用OpenCSVSerde解析日期 [英] Athena unable to parse date using OpenCSVSerde

查看:170
本文介绍了雅典娜无法使用OpenCSVSerde解析日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在S3上有一个非常简单的csv文件

I have a very simple csv file on S3

"i","d","f","s"
"1","2018-01-01","1.001","something great!"
"2","2018-01-02","2.002","something terrible!"
"3","2018-01-03","3.003","I'm an oil man"

我正在尝试使用以下命令在此表上创建

I'm trying to create a table across this using the following command

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

查询表时(从测试中选择* )我遇到这样的错误:

When I query the table (select * from test) I'm getting an error like this:


HIVE_BAD_DATA:

解析字段值'2018- 01-01'for field 1:对于输入字符串: 2018-01-01

HIVE_BAD_DATA:
Error parsing field value '2018-01-01' for field 1: For input string: "2018-01-01"

更多信息:


  • 如果我将 d 列更改为字符串,查询将成功

  • 我以前使用Athena解析了文本文件中的日期;我相信使用 LazySimpleSerDe

  • 肯定似乎就像OpenCSVSerde的问题

  • If I change the d column to a string the query will succeed
  • I've previously parsed dates in text files using Athena; I believe using LazySimpleSerDe
  • Definitely seems like a problem with the OpenCSVSerde

文档明确表示支持该功能。寻找任何遇到此问题或任何建议的人。

The documentation definitely implies that this is supported. Looking for anyone who has encountered this, or any suggestions.

推荐答案

实际上,文档。您可能正在引用此摘录:

In fact, it is a problem with the documentation that you mentioned. You were probably referring to this excerpt:


[OpenCSVSerDe]可以识别DATE类型,如果它以UNIX
格式指定,像YYYY-MM-DD这样的LONG类型。

[OpenCSVSerDe] recognizes the DATE type if it is specified in the UNIX format, such as YYYY-MM-DD, as the type LONG.

很明显,您将日期格式化为YYYY-MM-DD。但是,该文档在该句子中极具误导性。当指的是UNIX格式时,实际上要记住 UNIX时代

Understandably, you were formatting your date as YYYY-MM-DD. However, the documentation is deeply misleading in that sentence. When it refers to UNIX format, it actually has UNIX Epoch Time in mind.

根据UNIX Epoch的定义,您的日期应为整数(因此在文档中引用了LONG类型)。您的日期应该是自1970年1月1日起经过的天数。

Based on the definition of UNIX Epoch, your dates should be integers (hence the reference to the type LONG in the documentation). Your dates should be the number of days that have elapsed since January 1, 1970.

例如,您的示例CSV应该如下所示:

For instance, your sample CSV should look like this:

"i","d","f","s"
"1","17532","1.001","something great!"
"2","17533","2.002","something terrible!"
"3","17534","3.003","I'm an oil man"

然后,您可以运行相同的命令:

Then you can run that exact same command:

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

如果用查询Athena表,请选择* from test ,您将得到:

  i       d          f              s           
 --- ------------ ------- --------------------- 
  1   2018-01-01   1.001   something great!     
  2   2018-01-02   2.002   something terrible!  
  3   2018-01-03   3.003   I'm an oil man    

一个类似的问题损害了上述文档中有关TIMESTAMP的解释:

An analogous problem also compromises the explanation on TIMESTAMP in the aforementioned documentation:


[OpenCSVSerDe]可以识别TIMESTAMP类型(如果在
UNIX中指定)格式,例如 yyyy-mm-dd hh:mm:ss [.f ...] ,类型为LONG。

似乎表明我们应该将时间戳格式为 yyyy-mm-dd hh:mm:ss [.f ...] 。并不是的。实际上,我们需要再次使用UNIX Epoch Time,但是这次要使用自1970年1月1日午夜以来经过的毫秒数。

It seems to indicate that we should format TIMESTAMPs as yyyy-mm-dd hh:mm:ss[.f...]. Not really. In fact, we need to use UNIX Epoch Time again, but this time with the number of milliseconds that have elapsed since Midnight 1 January 1970.

例如,请考虑以下内容样本CSV:

For instance, consider the following sample CSV:

"i","d","f","s","t"
"1","17532","1.001","something great!","1564286638027"
"2","17533","2.002","something terrible!","1564486638027"
"3","17534","3.003","I'm an oil man","1563486638012"

以及以下CREATE TABLE语句:

And the following CREATE TABLE statement:

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string, t timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

这将是 select * from test

  i       d          f              s                       t             
 --- ------------ ------- --------------------- ------------------------- 
  1   2018-01-01   1.001   something great!      2019-07-28 04:03:58.027  
  2   2018-01-02   2.002   something terrible!   2019-07-30 11:37:18.027  
  3   2018-01-03   3.003   I'm an oil man        2019-07-18 21:50:38.012  

这篇关于雅典娜无法使用OpenCSVSerde解析日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆