使用 Spark 将文件夹结构转换为 S3 上的分区 [英] Convert folders structure to partitions on S3 using Spark

查看:33
本文介绍了使用 Spark 将文件夹结构转换为 S3 上的分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 S3 上有很多数据位于文件夹而不是分区中.结构如下:

I have a lot of data on S3 which are in folder instead of partitions. The structure looks like this:

## s3://bucket/countryname/year/weeknumber/a.csv

s3://Countries/Canada/2019/20/part-1.csv
s3://Countries/Canada/2019/20/part-2.csv
s3://Countries/Canada/2019/20/part-3.csv

s3://Countries/Canada/2019/21/part-1.csv
s3://Countries/Canada/2019/21/part-2.csv

有什么方法可以将该数据转换为 parititons.像这样:

Is there any way to convert that data as parititons. Something like this:

s3://Countries/Country=Canada/Year=2019/Week=20/part-1.csv
s3://Countries/Country=Canada/Year=2019/Week=20/part-2.csv
s3://Countries/Country=Canada/Year=2019/Week=20/part-3.csv

s3://Countries/Country=Canada/Year=2019/Week=21/part-1.csv
s3://Countries/Country=Canada/Year=2019/Week=21/part-2.csv

我不知道如何执行此操作,而不是使用 for 循环遍历所有文件夹并加载数据,这很复杂.

I have no clue how to do this, instead of having a for loop which iterates over all the folders and load the data, which is complex.

任何帮助将不胜感激.

推荐答案

Hive 样式路径并不总是分区所必需的.我从你在 Athena 上下文中写的另一个问题中得到了这个问题,所以我猜底层的 Metastore 实际上是 Glue,而且你真的针对 Athena(我添加了 amazon-athena 标记到您的问题).

Hive style paths isn't always necessary for partitioning. I got to this question from another question you wrote in the context of Athena, so I'm going to guess that the underlying metastore is in fact Glue, and that you're really targeting Athena (I added the amazon-athena tag to your question).

在 Presto 或 Athena/Glue 中,您可以为任何类型的路径添加分区,只要前缀不重叠​​即可.例如,您要在第一个示例中添加分区,您将执行以下操作:

In Presto, or Athena/Glue you can add partitions with for any kind of path, as long as the prefixes don't overlap. For example, you to add the partitions in your first example you would do this:

ALTER TABLE table_name ADD IF NOT EXISTS
  PARTITION (country = 'Canada', year_week = '2019-20') LOCATION 's3://Countries/Canada/2019/20/'
  PARTITION (country = 'Canada', year_week = '2019-21') LOCATION 's3://Countries/Canada/2019/21/'

这假设有一个 year_week 列,但是如果需要,您可以将 yearweek 作为单独的列(并且执行 (country = 'Canada', year = '2019', week = '20')),两者都可以.

This assumes there is a year_week column, but you could have year and week as separate columns if you want (and do (country = 'Canada', year = '2019', week = '20')), either works.

为什么几乎所有 Athena 示例都使用 Hive 样式路径(例如 country=Canada/year=2019/week=20/part-1.csv)?部分是由于历史原因,IIRC Hive 不支持任何其他方案,分区和路径是紧密耦合的.另一个原因是 Athena/Presto 命令 MSCK REPAIR TABLE 仅适用于该分区样式(但您想要 无论如何都要避免依赖该命令).还有其他工具可以假设或使用这种风格,而没有其他工具.如果你不使用这些,那就无所谓了.

Why are almost all Athena examples using Hive style paths (e.g. country=Canada/year=2019/week=20/part-1.csv)? Part of it is for historical reasons, IIRC Hive doesn't support any other scheme, partitioning and paths are tightly coupled. Another reason is that the Athena/Presto command MSCK REPAIR TABLE works only with that style of partitioning (but you want to avoid relying on that command anyway). There are also other tools that assume, or work with that style and no other. If you aren't using those, then it doesn't matter.

如果您绝对必须使用 Hive 样式分区,则有一项功能可让您在单独的路径结构中创建文件的符号链接".您可以在此处找到有关如何操作的说明:https://stackoverflow.com/a/55069330/1109 – 但是请记住,这意味着您必须使其他路径结构保持最新.如果您不必为分区使用 Hive 样式路径,我建议您不要为增加的复杂性而烦恼.

If you absolutely must use Hive style partitioning, there is a feature that lets you create "symlinks" to files in a separate path structure. You can find instructions on how to do it here: https://stackoverflow.com/a/55069330/1109 – but keep in mind that this means that you'll have to keep that other path structure up to date. If you don't have to use Hive style paths for your partitions, I would advice that you don't bother with the added complexity.

这篇关于使用 Spark 将文件夹结构转换为 S3 上的分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆