Azure Databricks中DBFS的数据大小限制是多少 [英] What is the Data size limit of DBFS in Azure Databricks

查看:66
本文介绍了Azure Databricks中DBFS的数据大小限制是多少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了

重要提示::即使DBFS根目录是可写的,我们还是建议您将数据存储在已安装的对象存储中,而不是存储在DBFS根目录中.

与将数据存储在存储帐户中相比,建议将数据存储在已安装的存储帐户中的原因是位于ADB工作区中.

原因1::当您通过Storage Explorer在外部使用相同的存储帐户时,您没有写权限.

原因2:您不能将相同的存储帐户用于另一个ADB工作空间,也不能将相同的存储帐户链接服务用于Azure Data Factory或Azure突触工作空间.

原因3::将来,您决定使用Azure Synapse工作区而不是ADB.

原因4::要删除现有工作区该怎么办.

Databricks文件系统(DBFS)是安装在Azure Databricks工作区中的分布式文件系统,可在Azure Databricks群集上使用.DBFS是可伸缩对象存储(即ADLS gen2)之上的抽象.

Azure Data Lake Storage Gen2中可以存储的数据量没有限制.

注意:Azure Data Lake Storage Gen2能够存储和提供许多EB的数据.

对于Azure Databricks文件系统(DBFS)-仅支持大小小于 2GB 的文件.

注意:如果您使用本地文件I/O API读取或写入大于2GB的文件,则可能会看到损坏的文件.而是使用DBFS CLI,dbutils.fs或Spark API或使用/dbfs/ml文件夹访问大于2GB的文件.

对于Azure存储 –最大存储帐户容量为 5 PiB PB.

下表描述了Azure通用v1,v2,Blob存储和阻止Blob存储帐户的默认限制.入口限制是指发送到存储帐户的所有数据.出口限制是指从存储帐户收到的所有数据.

注意:单个块Blob的限制为 4.75 TB .

I read here that storage limit on AWS Databricks is 5TB for individual file and we can store as many files as we want So does the same limit apply to Azure Databricks? or, is there some other limit applied on Azure Databricks?

Update:

@CHEEKATLAPRADEEP Thanks for the explanation but, can someone please share the reason behind: "we recommend that you store data in mounted object storage rather than in the DBFS root"

I need to use DirectQuery (because of huge data size) in Power BI and ADLS doesnt support that as of now.

解决方案

From Azure Databricks Best Practices: Do not Store any Production Data in Default DBFS Folders

Important Note: Even though the DBFS root is writeable, we recommend that you store data in mounted object storage rather than in the DBFS root.

Reason for recommending to store data in mounted storage account than storing in storage account is located in ADB workspace.

Reason1: You don't have write permission, when you use the same storage account externally via Storage Explorer.

Reason 2: You cannot use the same storage accounts for another ADB workspace or use the same storage account linked service for Azure Data Factory or Azure synapse workspace.

Reason 3: In future, you decided to use Azure Synapse workspaces than ADB.

Reason 4: What if you want to delete the existing workspace.

Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. DBFS is an abstraction on top of scalable object storage i.e. ADLS gen2.

There is no restriction on amount of data you can store in Azure Data Lake Storage Gen2.

Note: Azure Data Lake Storage Gen2 able to store and serve many exabytes of data.

For Azure Databricks Filesystem (DBFS) - Support only files less than 2GB in size.

Note: If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder.

For Azure Storage – Maximum storage account capacity is 5 PiB Petabytes.

The following table describes default limits for Azure general-purpose v1, v2, Blob storage, and block blob storage accounts. The ingress limit refers to all data that is sent to a storage account. The egress limit refers to all data that is received from a storage account.

Note: Limitation on single block blob is 4.75 TB.

这篇关于Azure Databricks中DBFS的数据大小限制是多少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆