雅典娜无法从AWS DMS解析CSV文件 [英] Athena can't resolve CSV files from AWS DMS

查看:110
本文介绍了雅典娜无法从AWS DMS解析CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将DMS配置为将数据从MySQL RDS连续复制到S3。这将创建两种类型的CSV文件:完全加载和更改数据捕获(CDC)。根据我的测试,我有以下文件:

I've DMS configured to continuously replicate data from MySQL RDS to S3. This creates two type of CSV files: a full load and change data capture (CDC). According to my tests, I have the following files:

testdb/addresses/LOAD001.csv.gz
testdb/addresses/20180405_205807186_csv.gz

DMS正常运行后,我触发了一个AWS Glue Crawler来构建数据包含MySQL复制文件的S3存储桶的目录,因此Athena用户将能够在基于S3的Data Lake中构建查询。

After DMS is running properly, I trigger a AWS Glue Crawler to build the Data Catalog for the S3 Bucket that contains the MySQL Replication files, so the Athena users will be able to build queries in our S3 based Data Lake.

不幸的是,爬虫没有建立S3中存储的表的正确表模式。
对于上面的示例,它为雅典娜创建了两个表:

Unfortunately the crawlers are not building the correct table schema for the tables stored in S3. For the example above It creates two tables for Athena:

addresses
20180405_205807186_csv_gz

文件 20180405_205807186_csv.gz 包含一行更新,但搜寻器无法合并这两个信息(从 LOAD001.csv.gz 进行第一次加载,并进行 20180405_205807186_csv.gz 中所述的更新)。

The file 20180405_205807186_csv.gz contains a one line update, but the crawler is not capable of merging the two informations (taking the first load from LOAD001.csv.gz and making the updpate described in 20180405_205807186_csv.gz).

我还尝试按照本博客文章中的描述在Athena控制台中创建表: https://aws.amazon。 com / pt / blogs / database / using-aws-database-migration-service-and-amazon-athena复制并运行SQL服务器数据库中的临时查询/一个>。
,但不能产生所需的输出。

I also tried to create the table in the Athena console, as described in this blog post:https://aws.amazon.com/pt/blogs/database/using-aws-database-migration-service-and-amazon-athena-to-replicate-and-run-ad-hoc-queries-on-a-sql-server-database/. But it does not yield the desired output.

来自博客文章:


使用Amazon Athena查询数据时(本文稍后),您
只需将文件夹位置指向Athena,查询结果
通过合并来自以下位置的数据来包含现有和新数据插入这两个
文件。

When you query data using Amazon Athena (later in this post), you simply point the folder location to Athena, and the query results include existing and new data inserts by combining data from both files.

我错过了什么吗?

推荐答案

如果文件的结构相同,Athena将在am S3中合并文件。该博客只讲在cdc文件中插入新数据。您必须建立一个流程来合并CDC文件。我敢肯定,这不是您想听到的。

Athena will combine the files in am S3 if they are the same structure. The blog speaks to only inserts of new data in the cdc files. You'll have to build a process to merge the CDC files. Not what you wanted to hear, I'm sure.

来自博客文章
使用Amazon Athena来查询数据(本文后文),由于AWS DMS添加了表示对CDC复制的一部分创建的新文件进行插入,删除和更新的列的方式,因此我们将无法通过组合使用来运行Athena查询来自两个文件(初始加载和CDC文件)的数据。

From the blog post: "When you query data using Amazon Athena (later in this post), due to the way AWS DMS adds a column indicating inserts, deletes and updates to the new file created as part of CDC replication, we will not be able to run the Athena query by combining data from both files (initial load and CDC files)."

这篇关于雅典娜无法从AWS DMS解析CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆