如何以编程方式读取AWS Glue数据目录表架构 [英] How to read AWS Glue Data Catalog table schemas programmatically

查看:65
本文介绍了如何以编程方式读取AWS Glue数据目录表架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组统一结构的每日CSV文件,我将这些文件上传到S3.有一个下游作业将CSV数据加载到Redshift数据库表中.CSV中的列数可能会增加,从那时起,新文件中将包含新列.发生这种情况时,我想检测到更改并将列自动添加到目标Redshift表中.

I have a set of daily CSV files of uniform structure which I will upload to S3. There is a downstream job which loads the CSV data into a Redshift database table. The number of columns in the CSV may increase and from that point onwards the new files will come with the new columns in them. When this happens, I would like to detect the change and add the column to the target Redshift table automatically.

我的计划是在源CSV文件上运行Glue Crawler.模式中的任何更改都会在Glue数据目录中生成该表的新版本.然后,我想使用Java,.NET或其他语言以编程方式读取Glue数据目录中Table的最新版本的表结构(列及其数据类型),并将其与Redshift表的架构进行比较.如果发现新的列,我将生成一个DDL语句来更改Redshift表以添加这些列.

My plan is to run a Glue Crawler on the source CSV files. Any change in schema would generate a new version of the table in the Glue Data Catalog. I would then like to programmatically read the table structure (columns and their datatypes) of the latest version of the Table in the Glue Data Catalog using Java, .NET or other languages and compare it with the schema of the Redshift table. In case new columns are found, I will generate a DDL statement to alter the Redshift table to add the columns.

有人可以指出我一些使用Java,.NET或其他语言读取Glue数据目录表的示例吗?有没有更好的主意,可以自动向Redshift表中添加新列?

Can someone point me to any examples of reading Glue Data Catalog tables using Java, .NET or other languages? Are there any better ideas to automatically add new columns to Redshift tables?

推荐答案

如果要使用Java,请使用依赖项:

If you want to use Java, use the dependency:

<dependency>
  <groupId>com.amazonaws</groupId>
  <artifactId>aws-java-sdk-glue</artifactId>
  <version>{VERSION}</version>
</dependency>

这是获取表版本和列列表的代码段:

And here's a code snippet to get your table versions and the list of columns:

AWSGlue client = AWSGlueClientBuilder.defaultClient();
GetTableVersionsRequest tableVersionsRequest = new GetTableVersionsRequest()
    .withDatabaseName("glue_catalog_database_name")
    .withCatalogId("table_name_generated_by_crawler");
GetTableVersionsResult results = client.getTableVersions(tableVersionsRequest);
// Here you have all the table versions, at this point you can check for new ones
List<TableVersion> versions = results.getTableVersions();
// Here's how to get to the table columns
List<Column> tableColumns = versions.get(0).getTable().getStorageDescriptor().getColumns();

在这里您可以看到 TableVersion StorageDescriptor 对象.

Here you can see AWS Doc for the TableVersion and the StorageDescriptor objects.

您还可以将 boto3库用于Python .

希望这会有所帮助.

这篇关于如何以编程方式读取AWS Glue数据目录表架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆