使用 Spark 将文件夹结构转换为 S3 上的分区 [英] Convert folders structure to partitions on S3 using Spark
问题描述
我在 S3 上有很多数据位于文件夹而不是分区中.结构如下:
I have a lot of data on S3 which are in folder instead of partitions. The structure looks like this:
## s3://bucket/countryname/year/weeknumber/a.csv
s3://Countries/Canada/2019/20/part-1.csv
s3://Countries/Canada/2019/20/part-2.csv
s3://Countries/Canada/2019/20/part-3.csv
s3://Countries/Canada/2019/21/part-1.csv
s3://Countries/Canada/2019/21/part-2.csv
有什么方法可以将该数据转换为 parititons.像这样:
Is there any way to convert that data as parititons. Something like this:
s3://Countries/Country=Canada/Year=2019/Week=20/part-1.csv
s3://Countries/Country=Canada/Year=2019/Week=20/part-2.csv
s3://Countries/Country=Canada/Year=2019/Week=20/part-3.csv
s3://Countries/Country=Canada/Year=2019/Week=21/part-1.csv
s3://Countries/Country=Canada/Year=2019/Week=21/part-2.csv
我不知道如何执行此操作,而不是使用 for 循环遍历所有文件夹并加载数据,这很复杂.
I have no clue how to do this, instead of having a for loop which iterates over all the folders and load the data, which is complex.
任何帮助将不胜感激.
推荐答案
Hive 样式路径并不总是分区所必需的.我从你在 Athena 上下文中写的另一个问题中得到了这个问题,所以我猜底层的 Metastore 实际上是 Glue,而且你真的针对 Athena(我添加了 amazon-athena
标记到您的问题).
Hive style paths isn't always necessary for partitioning. I got to this question from another question you wrote in the context of Athena, so I'm going to guess that the underlying metastore is in fact Glue, and that you're really targeting Athena (I added the amazon-athena
tag to your question).
在 Presto 或 Athena/Glue 中,您可以为任何类型的路径添加分区,只要前缀不重叠即可.例如,您要在第一个示例中添加分区,您将执行以下操作:
In Presto, or Athena/Glue you can add partitions with for any kind of path, as long as the prefixes don't overlap. For example, you to add the partitions in your first example you would do this:
ALTER TABLE table_name ADD IF NOT EXISTS
PARTITION (country = 'Canada', year_week = '2019-20') LOCATION 's3://Countries/Canada/2019/20/'
PARTITION (country = 'Canada', year_week = '2019-21') LOCATION 's3://Countries/Canada/2019/21/'
这假设有一个 year_week
列,但是如果需要,您可以将 year
和 week
作为单独的列(并且执行 (country = 'Canada', year = '2019', week = '20')
),两者都可以.
This assumes there is a year_week
column, but you could have year
and week
as separate columns if you want (and do (country = 'Canada', year = '2019', week = '20')
), either works.
为什么几乎所有 Athena 示例都使用 Hive 样式路径(例如 country=Canada/year=2019/week=20/part-1.csv
)?部分是由于历史原因,IIRC Hive 不支持任何其他方案,分区和路径是紧密耦合的.另一个原因是 Athena/Presto 命令 MSCK REPAIR TABLE
仅适用于该分区样式(但您想要 无论如何都要避免依赖该命令).还有其他工具可以假设或使用这种风格,而没有其他工具.如果你不使用这些,那就无所谓了.
Why are almost all Athena examples using Hive style paths (e.g. country=Canada/year=2019/week=20/part-1.csv
)? Part of it is for historical reasons, IIRC Hive doesn't support any other scheme, partitioning and paths are tightly coupled. Another reason is that the Athena/Presto command MSCK REPAIR TABLE
works only with that style of partitioning (but you want to avoid relying on that command anyway). There are also other tools that assume, or work with that style and no other. If you aren't using those, then it doesn't matter.
如果您绝对必须使用 Hive 样式分区,则有一项功能可让您在单独的路径结构中创建文件的符号链接".您可以在此处找到有关如何操作的说明:https://stackoverflow.com/a/55069330/1109 – 但是请记住,这意味着您必须使其他路径结构保持最新.如果您不必为分区使用 Hive 样式路径,我建议您不要为增加的复杂性而烦恼.
If you absolutely must use Hive style partitioning, there is a feature that lets you create "symlinks" to files in a separate path structure. You can find instructions on how to do it here: https://stackoverflow.com/a/55069330/1109 – but keep in mind that this means that you'll have to keep that other path structure up to date. If you don't have to use Hive style paths for your partitions, I would advice that you don't bother with the added complexity.
这篇关于使用 Spark 将文件夹结构转换为 S3 上的分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!