AWS Glue爬网程序无法提取CSV标头 [英] AWS Glue Crawler Cannot Extract CSV Headers

查看:262
本文介绍了AWS Glue爬网程序无法提取CSV标头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的智慧到此为止...

At my wits end here...

我有15条通过直线查询生成的csv文件,例如:

I have 15 csv files that I am generating from a beeline query like:

beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv

我选择 dsv 是因为字符串字段包含逗号,并且不加引号,这会进一步破坏胶水。此外,根据文档,内置的csv分类器可以处理管道(大部分情况下可以处理管道)。

I chose dsv because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does).

无论如何,我将这15个csv文件上传到

Anyway, I upload these 15 csv files to an s3 bucket and run my crawler.

一切正常。对于其中的14个。

Everything works great. For 14 of them.

Glue能够提取除单个文件之外的每个文件的标题行,将列命名为 col_0 col_1 等,并在我的选择查询中包括标题行。

Glue is able to extract the header line for every single file except one, naming the columns col_0, col_1, etc, and including the header line in my select queries.

任何人都可以提供任何见解

Can anyone provide any insight into what could possibly be different about this one file that is causing this?

如果导致这种情况,我可能会觉得此csv文件中的某些字段可能会在某种程度上,是用UTF-16编码的。我最初打开它时,有些奇怪的?左右浮动的字符。

If it helps, I have a feeling that some of the fields in this csv file may, at some point, been encoded in UTF-16 or something. When I originally open it, there were some weird "?" characters floating around.

我已经对其进行了 tr -d'\000'的清理工作,

I've run tr -d '\000' on it in an effort to clean it up, but that could have not been enough.

同样,我可以进行的任何潜在客户,建议或实验都很好。顺便说一句,我希望爬虫能够执行所有操作(即:不需要手动更改架构并关闭更新)。

Again, any leads, suggestions, or experiments I can run would be great. Btw, I would prefer if the crawler was able to do everything (ie: not needing to manually change the schema and turn off updates).

感谢阅读。

编辑:

感觉与它有关来源


潜在标题中的每一列都解析为STRING数据类型。

Every column in a potential header parses as a STRING data type.

除了最后一列之外,潜在标题中的每一列内容少于150个字符。为了允许尾随定界符,文件的最后一列可以为空。

Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.

潜在标题中的每一列都必须满足AWS Glue regex对列名的要求。

Every column in a potential header must meet the AWS Glue regex requirements for a column name.

标题行必须与数据行足够不同。要确定这一点,必须将一个或多个行解析为STRING类型以外的其他行。如果所有列的类型均为STRING,则第一行数据与随后的行没有足够的不同以用作标题。


推荐答案

添加自定义分类器修复了我的类似问题。

Adding a Custom Classifier fixed a similar issue of mine.

您可以通过以下方式避免标题检测(当所有列均为字符串类型时,此方法不起作用)在创建自定义分类器时,将 ContainsHeader 设置为 PRESENT ,然后通过 Header提供列名。创建自定义分类器后,您可以将其分配给搜寻器。由于已将其添加到搜寻器,因此您无需在事后对架构进行更改,也不必担心在下一次搜寻器运行中会覆盖这些更改。使用boto3,它看起来像:

You can avoid header detection (which doesn't work when all columns are string type) by setting ContainsHeader to PRESENT when creating the custom classifier, and then provide the column names through Header. Once the custom classifier has been created you can assign this to the crawler. Since this is added to the crawler, you won't need to make changes to the schema after the fact, and don't risk these changes being overwritten in the next crawler run. Using boto3, it would look something like:

import boto3


glue = boto3.client('glue')

glue.create_classifier(CsvClassifier={
    'Name': 'contacts_csv',
    'Delimiter': ',',
    'QuoteSymbol': '"',
    'ContainsHeader': 'PRESENT',
    'Header': ['contact_id', 'person_id', 'type', 'value']
})

glue.create_crawler(Name=GLUE_CRAWLER,
                    Role=role.arn,
                    DatabaseName=GLUE_DATABASE,
                    Targets={'S3Targets': [{'Path': s3_path}]},
                    Classifiers=['contacts_csv'])

这篇关于AWS Glue爬网程序无法提取CSV标头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆