Data Lake的最佳文件存储 [英] Optimal file storage for Data Lake

查看：90 发布时间：2019/6/18 13:24:59 AzureDataLake

本文介绍了Data Lake的最佳文件存储的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已阅读到在Data Lake Gen 1中存储数据的最佳实践是将数据存储在不小于256mb的文件中.目前，我每天都会在Data Lake中进行增量加载，加载到按日期划分的文件夹中，并按源/表划分文件.这 ADL中的根目录，我正在调用RAW.示例RAW/SourceSystem/年/月/日/SourceTableName.csv

I've read that a best practice for storing data in Data Lake gen 1 is to store data in files no smaller than 256mb. I am currently doing daily incremental loads into our Data Lake into folders partitioned by date and files by source/table. The root directory in the ADL I am calling RAW. Example RAW/SourceSystem/Year/Month/Day/SourceTableName.csv

有时只有几个新记录，这意味着文件非常小.即使文件较小，也可以以这种方式保存文件吗?我最终将增量文件与生产文件合并到另一个文件中目录，以便每个源表有一个大文件.因此PRODUCTION/SourceSystem/SourceTableName.csv包含所有数据.

Sometimes there's only a couple of new records which means the file is very small. Is it okay to keep the files stored in this manner even if they are smaller in size? I am eventually merging the incremental file with a production file in a different directory so that I have one large file per source table. So PRODUCTION/SourceSystem/SourceTableName.csv that contains all data.

我很好奇这是否有意义，或者ADL中是否有更好的文件结构.本质上，我有一个RAW目录，其中包含按日期划分的许多小增量文件；还有一个PRODUCTION目录，其中包含许多包含日期的大文件所有RAW数据合并在一起.

I'm curious if this makes sense or if there's a better file structure in ADL. Essentially I have a RAW directory that contains many small incremental files partitioned by date, and a PRODUCTION directory that contains many large files that contains all the RAW data merged together.

Data Lake的最佳文件存储 [英] Optimal file storage for Data Lake

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

Data Lake的最佳文件存储 [英] Optimal file storage for Data Lake

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭