AWS Glue:爬网程序无法识别CSV格式的时间戳列 [英] AWS Glue: Crawler does not recognize Timestamp columns in CSV format

查看:144
本文介绍了AWS Glue:爬网程序无法识别CSV格式的时间戳列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

运行AWS Glue搜寻器时,它无法识别时间戳列.

When running the AWS Glue crawler it does not recognize timestamp columns.

我已经在CSV文件中正确格式化了ISO8601时间戳.首先,我希望Glue能够将它们自动分类为时间戳,而并非如此.

I have correctly formatted ISO8601 timestamps in my CSV file. First I expected Glue to automatically classify these as timestamps, which it does not.

我还尝试通过此链接 https创建自定义时间戳分类器://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html

这是我的分类器的样子

这也不能正确分类我的时间戳.

This also does not correctly classify my timestamps.

我已将grok调试器( https://grokdebug.herokuapp.com/放入) ,例如

I have put into grok debugger (https://grokdebug.herokuapp.com/) my data, for example

id,iso_8601_now,iso_8601_yesterday
0,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056
1,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056

,并且两者都匹配

%{TIMESTAMP_ISO8601:timestamp}

%{TIMESTAMP_ISO8601:timestamp}

%{YEAR}-%{MONTHNUM}-%{MONTHDAY} [T]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

import csv
from datetime import datetime, timedelta

with open("timestamp_test.csv", 'w', newline='') as f:
    w = csv.writer(f, delimiter=',')

    w.writerow(["id", "iso_8601_now", "iso_8601_yesterday"])

    for i in range(1000):
        w.writerow([i, datetime.utcnow().isoformat(), (datetime.utcnow() - timedelta(days=1)).isoformat()])

我希望AWS胶水将iso_8601列自动分类为时间戳.即使添加自定义grok分类器,它仍然不会将这两列中的任何一个分类为时间戳.

I expect AWS glue to automatically classify the iso_8601 columns as timestamps. Even when adding the custom grok classifier it still does not classify the either of the columns as timestamp.

这两列都归类为字符串.

Both columns are classified as strings.

分类器在搜寻器上处于活动状态

The classifer is active on the crawler

搜寻器输出的timestamp_test表

Output of the timestamp_test table by the crawler

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "id",
                    "type": "bigint",
                    "comment": ""
                },
                {
                    "name": "iso_8601_now",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "iso_8601_yesterday",
                    "type": "string",
                    "comment": ""
                }
            ]
        },
        "location": "s3://REDACTED/_csv_timestamp_test/",
        "inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "-1",
        "SerDeInfo": {
            "name": "",
            "serializationLib": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
            "parameters": {
                "field.delim": ","
            }
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {
            "skip.header.line.count": "1",
            "sizeKey": "58926",
            "objectCount": "1",
            "UPDATED_BY_CRAWLER": "REDACTED",
            "CrawlerSchemaSerializerVersion": "1.0",
            "recordCount": "1227",
            "averageRecordSize": "48",
            "CrawlerSchemaDeserializerVersion": "1.0",
            "compressionType": "none",
            "classification": "csv",
            "columnsOrdered": "true",
            "areColumnsQuoted": "false",
            "delimiter": ",",
            "typeOfData": "file"
        },
        "SkewedInfo": {},
        "storedAsSubDirectories": "false"
    },
    "parameters": {
        "skip.header.line.count": "1",
        "sizeKey": "58926",
        "objectCount": "1",
        "UPDATED_BY_CRAWLER": "REDACTED",
        "CrawlerSchemaSerializerVersion": "1.0",
        "recordCount": "1227",
        "averageRecordSize": "48",
        "CrawlerSchemaDeserializerVersion": "1.0",
        "compressionType": "none",
        "classification": "csv",
        "columnsOrdered": "true",
        "areColumnsQuoted": "false",
        "delimiter": ",",
        "typeOfData": "file"
    }
}

推荐答案

根据创建表文档,时间戳格式为yyyy-mm-dd hh:mm:ss[.f...]

According to CREATE TABLE doc, the timestamp format is yyyy-mm-dd hh:mm:ss[.f...]

如果必须使用ISO8601格式,请添加此Serde参数'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS'

If you must use the ISO8601 format, add this Serde parameter 'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS'

您可以从Glue(1)更改表或从Athena(2)重新创建表:

You can alter the table from Glue(1) or recreate it from Athena(2):

  1. 胶水控制台>表格>编辑表格>将以上内容添加到Serde参数中.您还需要单击编辑模式",并将数据类型从字符串更改为时间戳
  2. 从雅典娜删除表并运行:

CREATE EXTERNAL TABLE `table1`(
  `id` bigint, 
  `iso_8601_now` timestamp, 
  `iso_8601_yesterday` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 
  'field.delim' = ',',
  'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS') 
LOCATION
  's3://REDACTED/_csv_timestamp_test/'

这篇关于AWS Glue:爬网程序无法识别CSV格式的时间戳列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆