Data Lake的最佳文件存储 [英] Optimal file storage for Data Lake

查看:90
本文介绍了Data Lake的最佳文件存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读到在Data Lake Gen 1中存储数据的最佳实践是将数据存储在不小于256mb的文件中.目前,我每天都会在Data Lake中进行增量加载,加载到按日期划分的文件夹中,并按源/表划分文件.这 ADL中的根目录,我正在调用RAW.示例RAW/SourceSystem/年/月/日/SourceTableName.csv

I've read that a best practice for storing data in Data Lake gen 1 is to store data in files no smaller than 256mb.  I am currently doing daily incremental loads into our Data Lake into folders partitioned by date and files by source/table.  The root directory in the ADL I am calling RAW. Example RAW/SourceSystem/Year/Month/Day/SourceTableName.csv

有时只有几个新记录,这意味着文件非常小.即使文件较小,也可以以这种方式保存文件吗?我最终将增量文件与生产文件合并到另一个文件中 目录,以便每个源表有一个大文件.因此PRODUCTION/SourceSystem/SourceTableName.csv包含所有数据.

Sometimes there's only a couple of new records which means the file is very small.  Is it okay to keep the files stored in this manner even if they are smaller in size?  I am eventually merging the incremental file with a production file in a different directory so that I have one large file per source table.  So PRODUCTION/SourceSystem/SourceTableName.csv that contains all data.

我很好奇这是否有意义,或者ADL中是否有更好的文件结构.本质上,我有一个RAW目录,其中包含按日期划分的许多小增量文件;还有一个PRODUCTION目录,其中包含许多包含日期的大文件 所有RAW数据合并在一起.

I'm curious if this makes sense or if there's a better file structure in ADL.  Essentially I have a RAW directory that contains many small incremental files partitioned by date, and a PRODUCTION directory that contains many large files that contains all the RAW data merged together. 


推荐答案

在Data Lake Storage Gen1中存储数据时,文件大小,文件数和文件夹结构会影响性能.性能可能取决于您将要使用的最终数据.

您可以参考文档 性能和规模注意事项

You may refer to the documentations Performance and scale considerations and Structure your data set and detailed explanation on the Stack Overflow thread which addresses similar query.


这篇关于Data Lake的最佳文件存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆