Apache Pig 中的日期时间解析 [英] Datetime parsing in Apache Pig

查看:34
本文介绍了Apache Pig 中的日期时间解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析 Pig 脚本中的日期,但收到以下错误Hadoop 未返回任何错误消息".

这是日期格式示例:3/9/16 2:50 PM

这是我解析它的方式:

data = LOAD 'cleaned.txt'AS(日期、区块、Primary_Type、描述、Location_Description、逮捕、国内、地区、年份);时间 = FOREACH 数据 GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;

你可以看到数据文件

你有什么想法吗?谢谢

<小时>

看起来错误是由times"上的 STORE 命令引起的.

如果我做 DUMP 那么我得到:

ERROR 1066:无法打开别名次数的迭代器

只有当我使用 ToDate 函数时才会发生这种情况,我还有其他脚本可以完美运行.

解决方案

首先需要在LOAD语句中指定加载器:

使用 PigStorage('\t')

我假设您使用的是制表符分隔符.如果您没有架构,请指定带有类型的架构!

所以你的加载语句将是这样的:data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:字符数组,年份:字符数组);

目前我只对所有内容使用 chararray 类型,但您必须指定适合您的表示类型.

在此之后,日期转换就像您所写的那样正常工作:(2016-03-09T23:55:00.000Z)(2016-03-09T23:55:00.000Z)(2016-03-09T23:55:00.000Z)

我的测试脚本:

data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray,国内:chararray,地区:chararray,年份:chararray);时间 = FOREACH 数据 GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;转储次数;

更新:一些解释

顺便说一下默认的loader是pig storage

<块引用>

PigStorage 是 LOAD 运算符的默认加载函数.

但最好指定.原始数据类型缺失导致的问题

<块引用>

如果不指定类型,字段默认为bytearray类型

因此 ToDate 在输入类型上失败.

I'm trying to parse a Date in a Pig script and i got the following error "Hadoop does not return any error message".

Here is the Date format example : 3/9/16 2:50 PM

And here is how I parse it :

data = LOAD 'cleaned.txt'
AS (Date, Block, Primary_Type, Description, Location_Description, Arrest, Domestic, District, Year);

times = FOREACH data GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;

You can see the data file here

Do you have any idea ? Thanks


EDIT:

It look like the error is caused by the STORE command on "times".

If I do a DUMP then I got:

ERROR 1066: Unable to open iterator for alias times

It happen only when I use the ToDate function, I have other scripts that work perfectly.

解决方案

First of all, you need to specify the loader in the LOAD statement:

USING PigStorage('\t')

I assumed that you're using tab separator. Than if you have no schema specify the schema with type!

So you're load statement will be sg like this:
data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:chararray, Year:chararray);

For now I just use chararray type for everything, but you have to specify the type what is the right representation for you.

After this the date conversion just works fine as you wrote: (2016-03-09T23:55:00.000Z) (2016-03-09T23:55:00.000Z) (2016-03-09T23:55:00.000Z)

My test script:

data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:chararray, Year:chararray);
times = FOREACH data GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;
DUMP times;

UPDATE: Some explanation

By the way the default loader is pig storage

PigStorage is the default load function for the LOAD operator.

but it's nicer to specify. The original issue caused by the lack of datatype

If you don't assign types, fields default to type bytearray

so the ToDate failed on the input type.

这篇关于Apache Pig 中的日期时间解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆