Athena/Glue-解析简单的JSON(但将其视为CSV) [英] Athena/Glue - Parsing simple JSON (but treats it like a CSV)

查看:99
本文介绍了Athena/Glue-解析简单的JSON(但将其视为CSV)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基于我之前的

下面是它生成的架构.从中可以看出,GLUE显然认为这是CSV而不是JSON.设置Glue搜寻器询问文件是什么类型时,我没有看到任何选项,我是否在某个隐藏选项中的某个位置错过了该选项?

对于像这样的简单示例,我可能可以手动修复架构.但是GLUE真的是一个糟糕的解析器吗?在我的实际应用程序中,我大约有150个字段,因此理想情况下它将为我生成所有列.

  CREATE EXTERNAL TABLE`flattb_testflatjson`(`col0`字符串,`col1`字符串,`col2`字符串,`col3`字符串,`col4`字符串)行格式已定界以','结尾的字段存储为INPUTFORMAT'org.apache.hadoop.mapred.TextInputFormat'输出格式'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'地点's3://relatix/polygonData/history/testflatjson/'TBLPROPERTIES('CrawlerSchemaDeserializerVersion'='1.0','CrawlerSchemaSerializerVersion'='1.0','UPDATED_BY_CRAWLER'='FlatJsonTestForAthena','areColumnsQuoted'='false','averageRecordSize'='83','分类'='csv','columnsOrdered'='true','compressionType'='none','定界符'=',','objectCount'='1','recordCount'='3','sizeKey'='255','typeOfData'='file') 

解决方案

胶水通常很糟糕,但这使我感到惊讶,直到我看到Achyut的评论:您的JSON格式不正确.

JSON是一种数据格式,而不是文件格式.格式正确的JSON文件不存在,因为规范未涵盖该内容.诸如Spark,Hadoop和Athena之类的工具要求将JSON数据存储在文件中,每行只有一个文档,因为这样可以轻松有效地处理数据.有时,这被称为"JSON流".(因为我们在谈论文件,所以这不是一个好名字)或以行分隔的JSON".

我认为您最好手动创建表格.您可以在文档中找到一个示例作为开始: https://docs.aws.amazon.com/athena/latest/ug/json-serde.html

您还应该使用适当的JSON序列化库来编写JSON,以免出现语法错误(例如冒号而不是冒号).

Based on my previous question, I built a simple JSON file with one "row" per line. I'm still in shock, because this is not valid JSON, as it doesn't have square brackets around it.

One data file:

{"firstName": "Neal",    "lastName": "Walters", "city": "Irving", "state", "TX"  }
{"firstName": "Fred",    "lastName": "Flintstone",   "city": "Bedrock",  "state", "TX"}
{"firstName": "Barney",  "lastName": "Rubble",   "city": "Stillwater",   "state", "OK"}

After running through GLUE, this was my first query, which was quite disappointing.

Below is the schema that it generated. From it, we can see that GLUE apparently thought this was a CSV instead of JSON. I didn't see any option when setting up the Glue crawler that asked what type of a file it was, did I miss that somewhere on some hidden option?

For a simple example like this, I can probably manually fix the schema. But is GLUE really such a poor parser? In my real application, I have about 150 fields, so ideally it will generate all the columns for me.

CREATE EXTERNAL TABLE `flattb_testflatjson`(
  `col0` string, 
  `col1` string, 
  `col2` string, 
  `col3` string, 
  `col4` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://relatix/polygonData/history/testflatjson/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='FlatJsonTestForAthena', 
  'areColumnsQuoted'='false', 
  'averageRecordSize'='83', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'objectCount'='1', 
  'recordCount'='3', 
  'sizeKey'='255', 
  'typeOfData'='file')

解决方案

Glue is terrible in general, but this actually surprised me until I saw the comment by Achyut: your JSON is malformed.

JSON is a data format, not a file format. There is no such thing as a correctly formatted JSON file because the specification doesn't cover that. Tools like Spark, Hadoop, and Athena require JSON data to be in files with one document per line, because that makes it easy to efficiently process the data. Sometimes this is referred to as "JSON stream" (which isn't a great name since we're talking about files), or "line-delimited JSON".

I think you will be better off just creating the table manually. You can find an example to start off from in the documentation: https://docs.aws.amazon.com/athena/latest/ug/json-serde.html

You should also use a proper JSON serialisation library for writing your JSON so that you don't end up with syntax errors like that accidental comma instead of colon.

这篇关于Athena/Glue-解析简单的JSON(但将其视为CSV)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆