如何在Athena中跳过与架构不匹配的文档? [英] How to skip documents that do not match schema in Athena?
问题描述
假设我有一个像这样的外部表:
Suppose I have an external table like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
我的一些文档具有无效的 profile.score
(字符串而不是整数)。
A few of my documents have an invalid profile.score
(a string rather than an integer).
这会导致雅典娜查询失败:
This causes queries in Athena to fail:
状态:{
状态:失败,
StateChangeReason: HIVE_BAD_DATA:解析字段0的字段值时出错:对于输入字符串:\ 4099999.9999999995\,
"Status": { "State": "FAILED", "StateChangeReason": "HIVE_BAD_DATA: Error parsing field value for field 0: For input string: \"4099999.9999999995\"",
如何配置Athena跳过不适合外部表模式的文档?
How can I configure Athena to skip the documents that do not fit the external table schema?
问题此处是关于查找有问题的文档;这个问题是关于跳过它们。
The question here is about finding the problematic documents; this question is about skipping them.
推荐答案
此处是有关如何排除特定文件的示例
Here is a sample on how to exclude a particular file
SELECT
*
FROM
"some_database"."some_table"
WHERE(
"$PATH" != 's3://path/to/a/file'
)
只需使用$ p
$ b
Just tested this approach with
SELECT
COUNT(*)
FROM
"some_database"."some_table"
-- Result: 68491573
SELECT
COUNT(*)
FROM
"some_database"."some_table"
WHERE(
"$PATH" != 's3://path/to/a/file'
)
-- Result: 68041452
SELECT
COUNT(*)
FROM
"some_database"."some_table"
WHERE(
"$PATH" = 's3://path/to/a/file'
)
-- Result: 450121
总计:450121 + 68041452 = 68491573
Total: 450121 + 68041452 = 68491573
这篇关于如何在Athena中跳过与架构不匹配的文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!