AWS Glue:如何在输出中添加带有源文件名的列? [英] AWS Glue: How to add a column with the source filename in the output?

查看:59
本文介绍了AWS Glue:如何在输出中添加带有源文件名的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人知道在Glue作业中将源文件名添加为列的方法吗?

Does anyone know of a way to add the source filename as a column in a Glue job?

我们创建了一个流程,其中我们在S3中抓取了一些文件以创建模式.然后,我们编写了一个作业,将文件转换为新格式,并将这些文件作为CSV写回到另一个S3存储桶中,以供我们的管道的其余部分使用.我们要做的是访问某种作业元属性,以便我们可以向包含原始文件名的输出文件中添加新列.

We created a flow where we crawled some files in S3 to create a schema. We then wrote a job that transforms the files to a new format, and the writes those files back to another S3 bucket as CSV, to be used by the rest of our pipeline. What we would like to do is get access to some sort of job meta properties so we can add a new column to the output file that contains the original filename.

我浏览了AWS文档和aws-glue-libs源,但没有发现任何问题.理想情况下,将有某种方法可以从awsglue.job包中获取元数据(我们使用的是python风格).

I looked through the AWS documentation and the aws-glue-libs source, but didn't see anything that jumped out. Ideally there would be some way to get metadata from the awsglue.job package (we're using the python flavor).

我仍在学习Glue,因此如果我使用了错误的术语,我们深表歉意.我也用spark标签对其进行了标记,因为我相信这就是Glue在幕后使用的东西.

I'm still learning Glue, so apologies if I'm using the wrong terminology. I tagged this with the spark tag as well, because I believe that's what Glue is using under the covers.

推荐答案

您可以在您的etl工作中使用spark做到这一点:

You can do it with spark in your etl job:

var df = glueContext.getCatalogSource(
  database = database,
  tableName = table,
  transformationContext = s"source-$database.$table"
).getDynamicFrame()
 .toDF()
 .withColumn("input_file_name", input_file_name())

glueContext.getSinkWithFormat(
  connectionType = "s3",
  options = JsonOptions(Map(
    "path" -> args("DST_S3_PATH")
  )),
  transformationContext = "",
  format = "parquet"
).writeDynamicFrame(DynamicFrame(df, glueContext))

请记住,它仅适用于getCatalogSource()API,不适用于create_dynamic_frame_from_options()

Remember it works with getCatalogSource() API only and not with create_dynamic_frame_from_options()

这篇关于AWS Glue:如何在输出中添加带有源文件名的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆