查找最新文件pyspark [英] Find latest file pyspark

查看:78
本文介绍了查找最新文件pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我想出了如何使用python查找最新文件.现在,我想知道是否可以使用pyspark找到最新文件.目前,我指定了一个路径,但我希望pyspark获得最新的修改文件.

So I've figured out how to find the latest file using python. Now I'm wondering if I can find the latest file using pyspark. Currently I specify a path but I'd like pyspark to get the latest modified file.

当前代码如下:

df = sc.read.csv("Path://to/file", header=True, inderSchema=True)

预先感谢您的帮助.

推荐答案

我从以下答案中复制了代码,以使HDFS API与PySpark一起使用:

I copied the code to get the HDFS API to work with PySpark from this answer: Pyspark: get list of files/directories on HDFS path

URI           = sc._gateway.jvm.java.net.URI
Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.s3.S3FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration

fs = # Create S3FileSystem object here

files = fs.listStatus(Path("Path://to/file"))

# You can also filter for directory here
file_status = [(file.getPath().toString(), file.getModificationTime()) for file in files]

file_status.sort(key = lambda tup: tup[1], reverse= True)

most_recently_updated = file_status[0][0]

spark.read.csv(most_recently_updated).option(...)

这篇关于查找最新文件pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆