如何在Databricks工作区中使用python获取Azure Datalake存储中存在的每个文件的最后修改时间? [英] How to get the last modification time of each files present in azure datalake storage using python in databricks workspace?

查看:66
本文介绍了如何在Databricks工作区中使用python获取Azure Datalake存储中存在的每个文件的最后修改时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取蔚蓝数据湖中每个文件的最后修改时间.

I am trying to get the last modification time of each file present in azure data lake.

文件= dbutils.fs.ls('/mnt/blob')

文件中的fi:打印(FI)

for fi in files: print(fi)

输出:-FileInfo(路径='dbfs:/mnt/blob/rule_sheet_recon.xlsx',名称='rule_sheet_recon.xlsx',大小= 10843)

Output:-FileInfo(path='dbfs:/mnt/blob/rule_sheet_recon.xlsx', name='rule_sheet_recon.xlsx', size=10843)

在这里,我无法获取文件的最后修改时间.有没有办法获得该财产.

Here i am unable to get the last modification time of the files. Is there any way to get that property.

我尝试在shell命令下面查看此属性,但无法将其存储在python对象中.

I tries this below shell command to see the properties,but unable to store it in python object.

%sh ls -ls/dbfs/mnt/blob/

输出:-总计0

0 -rw-r--r-- 1根root 13577 Sep 20 10:50 a.txt

0 -rw-r--r-- 1 root root 13577 Sep 20 10:50 a.txt

0 -rw-r--r-- 1个根目录10843 Sep 20 10:50 b.txt

0 -rw-r--r-- 1 root root 10843 Sep 20 10:50 b.txt

推荐答案

我们没有直接的方法来获取这些详细信息.但是,我们基于以下简单的python代码获得了这些详细信息.

We don't have direct method to get those details . But We got those details based on the following simple python code .

示例:考虑,您想要获取adls路径中的所有子目录和文件 container_name/container-Second ---您可以使用以下代码

Example : Consider, you want to get the all subdirectories and files in adls path container_name/container-Second --- You can use below code

from pyspark.sql.functions import col
from azure.storage.blob import BlockBlobService
from datetime import datetime
import os.path

block_blob_service = BlockBlobService(account_name='account-name', account_key='account-key')
container_name ='container-firstname'
second_conatainer_name ='container-Second'
#block_blob_service.create_container(container_name)
generator = block_blob_service.list_blobs(container_name,prefix="Recovery/")
report_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')


for blob in generator:
    length = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
    last_modified = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.last_modified
    file_size = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
    line = container_name+'|'+second_conatainer_name+'|'+blob.name+'|'+ str(file_size) +'|'+str(last_modified)+'|'+str(report_time)
    print(line)

这篇关于如何在Databricks工作区中使用python获取Azure Datalake存储中存在的每个文件的最后修改时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆