AWS Glue 自定义分类器 Json 路径 [英] AWS Glue Custom Classifiers Json Path

查看:31
本文介绍了AWS Glue 自定义分类器 Json 路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组像这样的 Json 数据文件

<预><代码>[{客户":玩具","文件名":"toy1.csv","file_row_number":1,"secondary_db_index":"4050",处理时间戳":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050",is_emailable":假,is_txtable":假,"is_loadable":false}]

我使用以下自定义分类器 Json Path 创建了一个 Glue Crawler

$[*]

Glue 返回正确的模式,并正确识别列.

但是,当我在 Athena 上查询数据时...所有数据都在第一列中,其余列都是空的.

如何让数据按照它们的列分布?

雅典娜查询图像

谢谢!

解决方案

这是一个与 Hive 相关的问题.我建议两种方法.首先,您可以在 Athena 中创建具有如下结构数据类型的新表:

创建外部表`示例`(`row`结构<client:string,filename:string,file_row_number:int,secondary_db_index:string,processed_timestamp:int,processed_datetime:string,entity_id:string,entity_name:string,is_emailable:boolean,is_txtable:boolean,is_loadable:boolean>COMMENT '来自解串器')行格式SERDE'org.openx.data.jsonserde.JsonSerDe'存储为输入格式'org.apache.hadoop.mapred.TextInputFormat'输出格式'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'地点's3://示例'TBL 属性 ('CrawlerSchemaDeserializerVersion'='1.0','CrawlerSchemaSerializerVersion'='1.0','UPDATED_BY_CRAWLER'='示例','averageRecordSize'='271','分类'='json','压缩类型'='无','jsonPath'='$[*]','objectCount'='1','记录计数'='1','sizeKey'='271','transient_lastDdlTime'='1535533583','typeOfData'='文件')

然后您可以按如下方式运行查询:

SELECT row.client, row.filename, row.file_row_number FROM "example"

其次,您可以如下重新设计您的 json 文件,然后再次运行 Crawler.在这个例子中,我使用了 Single-JSON-Record-Per-Line 格式.

{"client":"toys","filename":"toy1.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false},{"client":"toys2","filename":"toy2.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false}

I have a set of Json data files that look like this

[
  {"client":"toys",
   "filename":"toy1.csv",
   "file_row_number":1,
   "secondary_db_index":"4050",
   "processed_timestamp":1535004075,
   "processed_datetime":"2018-08-23T06:01:15+0000",
   "entity_id":"4050",
   "entity_name":"4050",
   "is_emailable":false,
   "is_txtable":false,
   "is_loadable":false}
]

I have created a Glue Crawler with the following custom classifier Json Path

$[*]

Glue returns the correct schema with the columns correctly identified.

However, when I query the data on Athena... all the data is landing in the first column and the rest of the columns are empty.

How can I get the data to spread according to their columns?

image of Athena query

Thank you!

解决方案

It is a issue connected to Hive. I suggest two approaches. Firstly, you can create new table in Athena with struct data type like this:

CREATE EXTERNAL TABLE `example`(
`row` struct<client:string,filename:string,file_row_number:int,secondary_db_index:string,processed_timestamp:int,processed_datetime:string,entity_id:string,entity_name:string,is_emailable:boolean,is_txtable:boolean,is_loadable:boolean> COMMENT 'from deserializer')
ROW FORMAT SERDE 
'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0', 
'CrawlerSchemaSerializerVersion'='1.0', 
'UPDATED_BY_CRAWLER'='example', 
'averageRecordSize'='271', 
'classification'='json', 
'compressionType'='none', 
'jsonPath'='$[*]', 
'objectCount'='1', 
'recordCount'='1', 
'sizeKey'='271', 
'transient_lastDdlTime'='1535533583', 
'typeOfData'='file')

And then you can run the query as follows:

SELECT row.client, row.filename, row.file_row_number FROM "example"

Secondly, you can re-design your json file as below and then run the Crawler again. In this example I used Single-JSON-Record-Per-Line format.

{"client":"toys","filename":"toy1.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false},
{"client":"toys2","filename":"toy2.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false}

这篇关于AWS Glue 自定义分类器 Json 路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆