在Athena中查询数据时如何识别S3中有问题的文档? [英] How do I identify problematic documents in S3 when querying data in Athena?

查看:86
本文介绍了在Athena中查询数据时如何识别S3中有问题的文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样的基本雅典娜查询:

I have a basic Athena query like this:

SELECT *
FROM my.dataset LIMIT 10

当我尝试运行它时,我收到如下错误消息:

When I try to run it I get an error message like this:


您的查询具有以下错误:

Your query has the following error(s):

HIVE_BAD_DATA:错误解析字段字段2的值:对于输入字符串: 32700.000000000004

HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"

如何识别具有无效字段的S3文档?

How do I identify the S3 document that has the invalid field?

我的文档是JSON。

My documents are JSON.

我的表如下:

CREATE EXTERNAL TABLE my.data (
  `id` string,
  `timestamp` string,
  `profile` struct<
    `name`: string,
    `score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');


推荐答案

模式不一致



模式不一致是指某些行中的值具有不同的数据类型。假设我们有两个json文件

Inconsistent schema

Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files

// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos",  "age":"eleven"}
{"name":"1Fabiana", "age":22}

// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos",  "age":11}
{"name":"2Fabiana", "age":22}

然后,简单查询 SELECT * FROM some_table 将失败,并

Then a simple query SELECT * FROM some_table will fail with

HIVE_BAD_DATA:解析字段1的字段值十一时出错:对于输入字符串:十一

HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"

但是,我们可以在 WHERE 子句

SELECT 
    "$PATH" AS "source_s3_file", 
    * 
FROM some_table 
WHERE "$PATH" != 's3://path/to/bad.json'

结果:

        source_s3_file | name     | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos  | 11
s3://path/to/good.json | 1Fabiana | 22

当然,当我们知道哪些文件损坏时,这是最好的情况。但是,您可以采用这种方法来手动地推断哪些文件是好的。您也可以使用 Like regexp_like 一次浏览多个文件。

Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.

SELECT 
    COUNT(*)
FROM some_table 
WHERE regexp_like("$PATH",  's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.

这种方法的明显缺点是执行查询的成本和花费的时间

The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.

在AWS Athena眼中,好记录是那些格式化为每行一个JSON的记录:

In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:

{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }

AWS雅典娜支持 OpenX JSON SerDe 库,可以通过指定

AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying

-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')

因此,以下查询将显示记录格式错误的文件:

when you create table. Thus, the following query will reveal files with malformed records:

SELECT 
    DISTINCT("$PATH")
FROM "some_database"."some_table" 
WHERE(
    col_1 IS NULL AND 
    col_2 IS NULL AND 
    col_3 IS NULL
    -- etc
)

注意:如果您100%确保除损坏的行中没有其他空字段,则单个 col_1 IS NULL

Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.

通常,只要'ignore.malformed.json'='true',格式错误的记录就没什么大不了的。例如,以下查询仍将成功执行
例如,如果文件包含:

In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed For example if a file contains:

{"name": "2Patrick","age": 35,"address": "North Street"}
{
    "name": "2Carlos",
    "age": 11,
    "address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}

以下查询仍将成功

SELECT 
    "$PATH" AS "source_s3_file",
    *
FROM some_table

结果:

              source_s3_file |     name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35  | North Street
2 s3://path/to/malformed.json|          |     |
3 s3://path/to/malformed.json|          |     |
4 s3://path/to/malformed.json|          |     |
5 s3://path/to/malformed.json|          |     |
6 s3://path/to/malformed.json|          |     |
7 s3://path/to/malformed.json| 2Fabiana | 22  | Main Street

同时带有'ignore.malformed.json'='false' (这是默认行为),完全相同的查询将引发错误

While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error


HIVE_CURSOR_ERROR:行不是有效的JSON对象-JSONException:JSONObject文本必须在2 [字符3第1行]处以'}'结尾

HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

这篇关于在Athena中查询数据时如何识别S3中有问题的文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆