使用DataLakeFileClient和进度条下载文件 [英] Download file with DataLakeFileClient and progress bar
问题描述
我需要使用DataLakeFileClient从Azure下载一个大文件,并在下载过程中显示一个进度条,例如tqdm.下面是我尝试使用的较小测试文件的代码.
I need to download a large file from Azure with DataLakeFileClient and show a progress bar like tqdm during the download. Below is the code that I was trying with a smaller test file.
# Download a File
test_file = DataLakeFileClient.from_connection_string(my_conn_str, file_system_name=fs_name, file_path="161263.tmp")
download = test_file.download_file()
blocks = download.chunks()
print(f"File Size = {download.size}, Number of blocks = {len(blocks)}")
with open("./newfile.tmp", "wb") as my_file:
for block in tqdm(blocks):
my_file.write(block)
结果在jupyter笔记本中显示如下,其块数与文件大小相同.
Results show like below in jupyter notebook, with number of blocks the same as file size.
如何正确设置块数和进度条正常工作?
How can I make the number of blocks correct and the progress bar work?
推荐答案
使用卡盘时,应注意只有文件大小大于 32MB
( 33554432字节
),然后将文件大小(这里的文件大小表示总文件大小-32MB
)分成每个块,每个块的大小为 4MB
.
When using chucks, you should note that only the file size is larger than 32MB
(33554432 bytes
), then the file size(here, the file size means that total file size - 32MB
) will be split into blocks with 4MB
size for each block.
例如,如果文件大小为39MB,它将被分成3个块.第一个块是32MB,第二个块是4MB,第三个块是3MB( 39MB-32MB-4MB
).
For example, if the file size is 39MB, it will be split into 3 blocks. The first block is 32MB, the 2nd block is 4MB, the 3rd block is 3MB(39MB - 32MB - 4MB
).
这里是一个例子,它可以很好地在我这边工作:
Here is an example, it can work well at my side:
from tqdm import tqdm
from azure.storage.filedatalake import DataLakeFileClient
import math
conn_str = "xxxxxxxx"
file_system_name="xxxx"
file_name="ccc.txt"
test_file = DataLakeFileClient.from_connection_string(conn_str,file_system_name,file_name)
download = test_file.download_file()
blocks = download.chunks()
number_of_blocks = 0
#if the file size is larger than 32MB
if len(blocks) > 33554432:
number_of_blocks = math.ceil((len(blocks) - 33554432) / 1024 / 1024 / 4) + 1
else:
number_of_blocks = 1
print(f"File Size = {download.size}, Number of blocks = {number_of_blocks}")
#initialize a tqdm instance
progress_bar = tqdm(total=download.size,unit='iB',unit_scale=True)
with open("D:\\a11\\ccc.txt","wb") as my_file:
for block in blocks:
#update the progress bar
progress_bar.update(len(block))
my_file.write(block)
progress_bar.close()
print("**completed**")
这篇关于使用DataLakeFileClient和进度条下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!