获取文件上次修改日期和文件名 pyspark 的脚本 [英] script to get the file last modified date and file name pyspark

查看:61
本文介绍了获取文件上次修改日期和文件名 pyspark 的脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个挂载点位置,它指向我们有多个文件的 blob 存储.我们需要找到文件的最后修改日期以及文件名.我正在使用以下脚本和文件列表如下:

I have a mount point location which is pointing to a blob storage where we have multiple files. We need to find the last modified date for a file along with the file name. I am using the below script and the list of files are as below:

/mnt/schema_id=na/184000-9.jsonl
/mnt/schema_id=na/185000-0.jsonl
/mnt/schema_id=na/185000-22.jsonl
/mnt/schema_id=na/185000-25.jsonl

import os
import time
# Path to the file/directory
path = "/mnt/schema_id=na"
         
ti_c = os.path.getctime(path)
ti_m = os.path.getmtime(path)
        
c_ti = time.ctime(ti_c)
m_ti = time.ctime(ti_m)
          
print(f"The file located at the path {path} was created at {c_ti} and was last modified at {m_ti}")

推荐答案

如果您使用操作系统级命令来获取文件信息,那么您将无法访问该确切位置 - 在 Databricks 上它位于 Databricks 文件系统上(DBFS).

If you're using operating system-level commands to get file information, then you can't access that exact location - on Databricks it's on the Databricks file system (DBFS).

要在 Python 级别实现这一点,您需要在路径前添加 /dbfs,因此它将是:

To get that on the Python level, you need to prepend the /dbfs to the path, so it will be:

...
path = "/dbfs/mnt/schema_id=na"
for file_item in os.listdir(path):
    file_path = os.path.join(path, file_item)[:5]
    ti_c = os.path.getctime(file_path)
    ...

注意 [:5] - 它用于从路径中去除 /dbfs 前缀以使其与 DBFS 兼容

note the [:5] - it's used to strip the /dbfs prefix from the path to make it compatible with DBFS

这篇关于获取文件上次修改日期和文件名 pyspark 的脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆