如何使用AWS Glue将许多CSV文件转换为Parquet [英] How to Convert Many CSV files to Parquet using AWS Glue

查看:172
本文介绍了如何使用AWS Glue将许多CSV文件转换为Parquet的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过以下设置使用AWS S3,Glue和Athena:

I'm using AWS S3, Glue, and Athena with the following setup:

S3->胶水->雅典娜

S3 --> Glue --> Athena

我的原始数据作为CSV文件存储在S3上.我正在使用Elu的Glue,并且正在使用Athena来查询数据.

My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data.

由于我使用的是Athena,因此我想将CSV文件转换为Parquet.我正在使用AWS Glue来执行此操作.这是我正在使用的当前过程:

Since I'm using Athena, I'd like to convert the CSV files to Parquet. I'm using AWS Glue to do this right now. This is the current process I'm using:

  1. 运行Crawler以读取CSV文件并填充数据目录.
  2. 运行ETL作业以从数据目录创建Parquet文件.
  3. 运行搜寻器以使用Parquet文件填充数据目录.

Glue作业仅允许我一次转换一张表.如果我有许多CSV文件,则此过程很快变得难以管理.是否有更好的方法,也许是正确"的方法,可以使用AWS Glue或其他一些AWS服务将许多 CSV文件转换为Parquet?

The Glue job only allows me to convert one table at a time. If I have many CSV files, this process quickly becomes unmanageable. Is there a better way, perhaps a "correct" way, of converting many CSV files to Parquet using AWS Glue or some other AWS service?

推荐答案

我遇到了完全相同的情况,我想高效地循环通过搜寻器分类的指向csv文件的目录表,然后将它们转换为镶木地板.不幸的是,网络上尚无可用信息.这就是为什么我在LinkedIn 解释了我是如何做到的.请阅读;特别是第5点.希望能有所帮助.请让我知道您的反馈.

I had the exact same situation where I wanted to efficiently loop through the catalog tables catalogued by crawler which are pointing to csv files and then convert them to parquet. Unfortunately there is not much information available in the web yet. That's why I have written a blog in LinkedIn explaining how I have done it. Please have a read; specially point #5. Hope that helps. Please let me know your feedback.

注意:根据Antti的反馈,我将粘贴以下博客摘录的解决方案:

Note: As per Antti's feedback, I am pasting the excerpt solution from my blog below:

  1. 遍历目录/数据库/表

作业向导"带有用于在数据源上运行预定义脚本的选项.问题是您可以选择的数据源是目录中的单个表.它没有让您选择在整个数据库或一组表上运行作业.无论如何,您都可以稍后修改脚本,但是在胶粘目录中遍历数据库表的方式也很难找到.有目录API,但缺少合适的示例.github示例存储库可以通过更多方案来丰富,以帮助开发人员.

The Job Wizard comes with option to run predefined script on a data source. Problem is that the data source you can select is a single table from the catalog. It does not give you option to run the job on the whole database or a set of tables. You can modify the script later anyways but the way to iterate through the database tables in glue catalog is also very difficult to find. There are Catalog APIs but lacking suitable examples. The github example repo can be enriched with lot more scenarios to help developers.

经过一番摸索之后,我想出了下面的脚本来完成工作.我已使用boto3客户端遍历表.如果有人需要帮助,我会在这里粘贴.如果您有更好的建议,我也希望收到您的来信

After some mucking around, I came up with the script below which does the job. I have used boto3 client to loop through the table. I am pasting it here if it comes to someone’s help. I would also like to hear from you if you have a better suggestion

import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


client = boto3.client('glue', region_name='ap-southeast-2')

databaseName = 'tpc-ds-csv'
print '\ndatabaseName: ' + databaseName

Tables = client.get_tables(DatabaseName=databaseName)

tableList = Tables['TableList']

for table in tableList:
    tableName = table['Name']
    print '\n-- tableName: ' + tableName

    datasource0 = glueContext.create_dynamic_frame.from_catalog(
        database="tpc-ds-csv", 
        table_name=tableName, 
        transformation_ctx="datasource0"
    )

    datasink4 = glueContext.write_dynamic_frame.from_options(
        frame=datasource0,
        connection_type="s3", 
        connection_options={
            "path": "s3://aws-glue-tpcds-parquet/"+ tableName + "/"
            },
        format="parquet",
        transformation_ctx="datasink4"
    )
job.commit()

这篇关于如何使用AWS Glue将许多CSV文件转换为Parquet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆