Apache Pig 中的日期时间解析 [英] Datetime parsing in Apache Pig
问题描述
我正在尝试解析 Pig 脚本中的日期,但收到以下错误Hadoop 未返回任何错误消息".
这是日期格式示例:3/9/16 2:50 PM
这是我解析它的方式:
data = LOAD 'cleaned.txt'AS(日期、区块、Primary_Type、描述、Location_Description、逮捕、国内、地区、年份);时间 = FOREACH 数据 GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;
你可以看到数据文件 你有什么想法吗?谢谢 看起来错误是由times"上的 STORE 命令引起的. 如果我做 DUMP 那么我得到: 只有当我使用 ToDate 函数时才会发生这种情况,我还有其他脚本可以完美运行. 首先需要在LOAD语句中指定加载器: 我假设您使用的是制表符分隔符.如果您没有架构,请指定带有类型的架构! 目前我只对所有内容使用 chararray 类型,但您必须指定适合您的表示类型. 在此之后,日期转换就像您所写的那样正常工作:(2016-03-09T23:55:00.000Z)(2016-03-09T23:55:00.000Z)(2016-03-09T23:55:00.000Z) 我的测试脚本: 更新:一些解释 顺便说一下默认的loader是pig storage PigStorage 是 LOAD 运算符的默认加载函数. 但最好指定.原始数据类型缺失导致的问题 如果不指定类型,字段默认为bytearray类型 因此 ToDate 在输入类型上失败. I'm trying to parse a Date in a Pig script and i got the following error "Hadoop does not return any error message". Here is the Date format example : 3/9/16 2:50 PM And here is how I parse it : You can see the data file here Do you have any idea ?
Thanks EDIT: It look like the error is caused by the STORE command on "times". If I do a DUMP then I got: It happen only when I use the ToDate function, I have other scripts that work perfectly. First of all, you need to specify the loader in the LOAD statement: I assumed that you're using tab separator.
Than if you have no schema specify the schema with type! For now I just use chararray type for everything, but you have to specify the type what is the right representation for you. After this the date conversion just works fine as you wrote:
(2016-03-09T23:55:00.000Z)
(2016-03-09T23:55:00.000Z)
(2016-03-09T23:55:00.000Z) My test script: UPDATE:
Some explanation By the way the default loader is pig storage PigStorage is the default load function for the LOAD operator. but it's nicer to specify.
The original issue caused by the lack of datatype If you don't assign types, fields default to type bytearray so the ToDate failed on the input type. 这篇关于Apache Pig 中的日期时间解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!ERROR 1066:无法打开别名次数的迭代器
使用 PigStorage('\t')
所以你的加载语句将是这样的:data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:字符数组,年份:字符数组);
data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray,国内:chararray,地区:chararray,年份:chararray);时间 = FOREACH 数据 GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;转储次数;
data = LOAD 'cleaned.txt'
AS (Date, Block, Primary_Type, Description, Location_Description, Arrest, Domestic, District, Year);
times = FOREACH data GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;
ERROR 1066: Unable to open iterator for alias times
USING PigStorage('\t')
So you're load statement will be sg like this:
data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:chararray, Year:chararray);
data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:chararray, Year:chararray);
times = FOREACH data GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;
DUMP times;