使用Spark将文件夹结构转换为S3上的分区 [英] Convert folders structure to partitions on S3 using Spark

查看:104
本文介绍了使用Spark将文件夹结构转换为S3上的分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在S3上有很多数据,它们位于文件夹而不是分区中。结构如下:

I have a lot of data on S3 which are in folder instead of partitions. The structure looks like this:

## s3://bucket/countryname/year/weeknumber/a.csv

s3://Countries/Canada/2019/20/part-1.csv
s3://Countries/Canada/2019/20/part-2.csv
s3://Countries/Canada/2019/20/part-3.csv

s3://Countries/Canada/2019/21/part-1.csv
s3://Countries/Canada/2019/21/part-2.csv

是否有任何方法可以将该数据转换为parittons。像这样的东西:

Is there any way to convert that data as parititons. Something like this:

s3://Countries/Country=Canada/Year=2019/Week=20/part-1.csv
s3://Countries/Country=Canada/Year=2019/Week=20/part-2.csv
s3://Countries/Country=Canada/Year=2019/Week=20/part-3.csv

s3://Countries/Country=Canada/Year=2019/Week=21/part-1.csv
s3://Countries/Country=Canada/Year=2019/Week=21/part-2.csv

我不知道该怎么做,而不是使用遍历所有文件夹并加载数据的for循环,这很复杂。

I have no clue how to do this, instead of having a for loop which iterates over all the folders and load the data, which is complex.

任何帮助将不胜感激。

推荐答案

配置样式路径对于分区并非总是必需的。我是从您在Athena上下文中写的另一个问题中得到这个问题的,所以我猜测基本的元存储区实际上是Glue,而您实际上是针对Athena的(我添加了

Hive style paths isn't always necessary for partitioning. I got to this question from another question you wrote in the context of Athena, so I'm going to guess that the underlying metastore is in fact Glue, and that you're really targeting Athena (I added the amazon-athena tag to your question).

在Presto或Athena / Glue中,您可以为任何类型的路径添加分区,例如只要前缀不重叠​​即可。例如,要在第一个示例中添加分区,您将执行以下操作:

In Presto, or Athena/Glue you can add partitions with for any kind of path, as long as the prefixes don't overlap. For example, you to add the partitions in your first example you would do this:

ALTER TABLE table_name ADD IF NOT EXISTS
  PARTITION (country = 'Canada', year_week = '2019-20') LOCATION 's3://Countries/Canada/2019/20/'
  PARTITION (country = 'Canada', year_week = '2019-21') LOCATION 's3://Countries/Canada/2019/21/'

year_week 列,但您可能会有 year week 如果需要,可以将其作为单独的列(并执行(国家='加拿大',年份='2019',星期='20')),

This assumes there is a year_week column, but you could have year and week as separate columns if you want (and do (country = 'Canada', year = '2019', week = '20')), either works.

为什么几乎所有的Athena示例都使用Hive样式路径(例如 country = Canada / year = 2019 / week = 20 / part-1.csv )?部分原因是由于历史原因,IIRC Hive不支持任何其他方案,分区和路径紧密耦合。另一个原因是Athena / Presto命令 MSCK REPAIR TABLE 仅适用于这种分区样式(但是您要无论如何都要避免依赖该命令)。还有其他工具可以采用该样式,或者可以使用该样式。如果您不使用它们,那就没关系。

Why are almost all Athena examples using Hive style paths (e.g. country=Canada/year=2019/week=20/part-1.csv)? Part of it is for historical reasons, IIRC Hive doesn't support any other scheme, partitioning and paths are tightly coupled. Another reason is that the Athena/Presto command MSCK REPAIR TABLE works only with that style of partitioning (but you want to avoid relying on that command anyway). There are also other tools that assume, or work with that style and no other. If you aren't using those, then it doesn't matter.

如果您绝对必须使用Hive样式分区,有一项功能可以让您创建指向单独路径结构中文件的符号链接。您可以在此处找到有关操作方法的说明: https://stackoverflow.com/a/55069330/1109 –但是请记住,这意味着您必须保持其他路径结构为最新。如果您不必为分区使用Hive样式路径,我建议您不要担心增加的复杂性。

If you absolutely must use Hive style partitioning, there is a feature that lets you create "symlinks" to files in a separate path structure. You can find instructions on how to do it here: https://stackoverflow.com/a/55069330/1109 – but keep in mind that this means that you'll have to keep that other path structure up to date. If you don't have to use Hive style paths for your partitions, I would advice that you don't bother with the added complexity.

这篇关于使用Spark将文件夹结构转换为S3上的分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆