如何使用AWS Glue从Web服务终端节点提取数据开始? [英] How can I use AWS Glue to start with data pulled from web service endpoints?

查看:72
本文介绍了如何使用AWS Glue从Web服务终端节点提取数据开始?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更多源数据来自Web服务端点,我需要定期对其进行轮询.一旦获得数据,我就可以使用pyspark执行传统的ETL,并最终将数据写入S3和Redshift.

More source data comes from an Web Service endpoint that I need to poll periodically. Once I get the data, I can perform traditional ETL using pyspark and eventually write the data to S3 and Redshift.

我不确定如何进行初始提取,甚至不确定我应该在AWS Glue文档中寻找什么.可以将源" Web服务端点视为关于数据目录的表吗?

I'm not sure how to do that initial extraction or even what I should be looking for in the AWS Glue docs. Can a "source" web service end point be considered a table with regard to the Data Catalog?

任何例子都更好.

推荐答案

我不认为'源'Web服务端点可以视为Glue数据目录中的表.但是,让它开始工作应该并不难.

I don't believe that a 'source' webservice endpoint can be considered a table in the Glue Data Catalog. But, it shouldn't be too difficult to get this to work.

  1. 设置一些内容以定期轮询此Web Service端点以检索您要查找的数据.轮询的数据应放入S3源"存储桶/位置.
  2. 在Glue数据目录中设置一个表,该表描述从步骤1开始轮询的数据.根据该数据的外观,您也许可以使用Crawler创建表,但是我有更好的体验手动创建表格(最初使用CloudFormation,最后使用CloudFormation创建).
  3. 根据提示,使用作业创建向导"(通过作业"视图中的"<代码>添加作业"按钮)来创建作业.这里的重要部分是确保在步骤2中将源"设置为表设置.
  4. 创建作业后,您将能够修改脚本(Python或Scala)以应用您选择的ETL.
  1. Setup something to poll this Web Service endpoint periodically to retrieve the data you are after. The data polled should be placed into an S3 'source' bucket/location.
  2. Setup a Table in the Glue Data Catalog that describes the data that is being polled from step 1. Depending on what this data looks like, you may be able to use a Crawler to create the table, but I have had better experiences with creating my tables manually (initially, and eventually with CloudFormation).
  3. Use the Job Creation Wizard (via Add Job button in the Jobs view) to create the job, following the prompts. The important part here is to make sure you set your 'source' as the table setup in step 2.
  4. After creating the job, you will be able to modify the script (either Python or Scala) to apply the ETL of your choosing.

AWS文档中的此页面相当详细地描述了该过程.

This page from the AWS documentation does a pretty good job of describing the process with a bit more detail.

这篇关于如何使用AWS Glue从Web服务终端节点提取数据开始?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆