从元数据中检索数据的方式如何在Glue Script中创建表 [英] How data retrieved from metadata created tables in Glue Script

查看:121
本文介绍了从元数据中检索数据的方式如何在Glue Script中创建表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在AWS Glue中,尽管我阅读了文档,但是并没有一件事被清除.以下是我的理解.

In AWS Glue, Although I read documentation, but I didn't get cleared one thing. Below is what I understood.

关于抓取工具:这将为S3或DynamoDB表创建一个元数据表.但是我不明白的是:Scala/Python脚本如何使用元数据创建的表从实际源"(例如DynamoDB or S3)检索数据.

Regarding Crawlers: This will create a metadata table for either S3 or DynamoDB table. But what I don't understand is: how does Scala/Python script able to retrieve data from Actual Source (say DynamoDB or S3) using Metadata created tables.

val input = glueContext
      .getCatalogSource(database = "my_data_base", tableName = "my_table")
      .getDynamicFrame()

以上行是否通过元数据表从实际源中检索数据?

Does above line retrieve data from actual source via metadata tables?

如果有人能够通过元数据表在Glue脚本中检索数据的幕后工作,我将感到很高兴.

I will be glad if someone can able to explain me behind the scenes of retrieving data in Glue script via metadata tables.

推荐答案

运行Glue搜寻器时,它将从S3或JDBC中获取元数据(取决于您的要求),并在AWS Glue数据目录中创建表.

When you run a Glue crawler it will fetch metadata from S3 or JDBC (depends on your requirement) and creates tables in AWS Glue Data Catalog.

现在,如果要从Glue ETL作业连接到此数据/表,则可以根据需要以多种方式进行操作:

Now if you want to connect to this data/tables from Glue ETL job then you can do it in multiple ways depending on your requirement:

  1. [from_options] [1]:如果要直接从S3/JDBC加载而不连接到Glue目录.

  1. [from_options][1] : if you want to load directly from S3/JDBC with out connecting to Glue catalog.

[from_catalog] [1]:如果要从Glue目录加载数据,则需要使用getCatalogSource方法将其与目录链接,如代码所示.顾名思义,它将使用Glue数据目录作为源并加载传递给此方法的特定表.

[from_catalog][1] : If you want to load data from Glue catalog then you need to link it with catalog using getCatalogSource method as shown in your code. As the name infers it will use Glue data catalog as source and load particular table that you pass to this method.

一旦查看了指向某个位置的表定义,它将建立连接并加载源中存在的数据.

Once it looks at your table definition which is pointed to a location then it will make a connection and load the data present in the source.

是的,如果要从Glue目录加载表,则需要使用getCatalogSource.

Yes you need to use getCatalogSource if you want to load tables from Glue catalog.

  1. 目录是否调查Crawler并引用实际的源和加载数据?

在此[link] [2]中查看图表.它将使您对流程有所了解.

Check out the diagram in this [link][2] . It will give you an idea about the flow.

  1. 如果在运行getCatalogSource之前删除了搜寻器,那么在这种情况下我将能够加载数据吗?

Crawler和Table是两个不同的组件.这完全取决于删除表的时间.如果在作业开始执行后删除该表,则不会有任何问题.如果您在执行开始前将其删除,则会遇到错误.

Crawler and Table are two different components. It all depends on when the table is deleted. If you delete the table after your job start to execute then there will not be any problem. If you delete it before execution starts then you will encounter an error.

  1. 如果我的来源有大量记录,该怎么办?那么这将加载所有记录,或者在这种情况下如何加载?

最好在源中包含大文件,这样可以避免大多数小文件问题.基于Spark的胶水,它将读取可放入内存的文件,然后进行计算.在读取AWS Glue中的较大文件时,请查看此[answer] [3]和[this] [4]以获取最佳实践. [1]:https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html [2]:https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html [3]:https://stackoverflow.com/questions/46638901/how-spark-read-a-large-file-petabyte-when-file-can-not-be-fit-in-sparks-main [4]:https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:〜:text = Incremental%20processing:%20Processing%20large%20datasets

It is good to have large files to be present in source so it will avoid most of the small files problem. Glue based on Spark and it will read files which can be fit in memory and then do the computations. Check this [answer][3] and [this][4] for best practices while reading larger files in AWS Glue. [1]: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html [2]: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html [3]: https://stackoverflow.com/questions/46638901/how-spark-read-a-large-file-petabyte-when-file-can-not-be-fit-in-sparks-main [4]: https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=Incremental%20processing:%20Processing%20large%20datasets

这篇关于从元数据中检索数据的方式如何在Glue Script中创建表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆