如何从单个目录仅加载 pyspark spark.read.csv 中的前 n 个文件 [英] How to load only first n files in pyspark spark.read.csv from a single directory
问题描述
- 我有一个场景,我正在加载和处理 4TB 的数据,一个文件夹中大约有 15000 个 .csv 文件.
- 由于资源有限,我打算分两次处理批次,然后将它们联合起来.
我想知道我是否只能加载 50%(或第一个批处理 1 中的文件数和批处理 2 中的其余文件数)使用
spark.read.csv.
- I have a scenario where I am loading and processing 4TB of data, which is about 15000 .csv files in a folder.
- since I have limited resources, I am planning to process them in two batches and them union them.
I am trying to understand if I can load only 50% (or first n number of files in batch1 and the rest in batch 2) using
spark.read.csv.
我不能使用正则表达式,因为这些文件是生成的来自多个来源并且它们的数量是奇数(来自某些来源很少,而其他来源有很多).如果我考虑使用通配符或正则表达式以不均匀的批次处理文件我可能无法获得优化的性能.
I can not use a regular expression as these files are generated from multiple sources and they are of uneven number(from some sources they are few and from other sources there are many ). If I consider processing files in uneven batches using wild cards or regex i may not get optimized performance.
有没有办法告诉 spark.read.csv 阅读器选择前 n 个文件,然后我只想提到加载最后 n-1 个文件
Is there a way where i can tell the spark.read.csv reader to pick first n files and next I would just mention to load last n-1 files
我知道这可以通过编写另一个程序来完成.但我不喜欢,因为我有超过 20000 个文件,我不想遍历它们.
I know this can be doneby writing another program. but I would not prefer as I have more than 20000 files and I dont want to iterate over them.
推荐答案
如果您使用 hadoop API 首先列出文件,然后基于此列表块创建数据帧,这很容易.例如:
It's easy if you use hadoop API to list files first and then create dataframes based on this list chunks. For example:
path = '/path/to/files/'
from py4j.java_gateway import java_import
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(path))
paths = [file.getPath().toString() for file in list_status]
df1 = spark.read.csv(paths[:7500])
df2 = spark.read.csv(paths[7500:])
这篇关于如何从单个目录仅加载 pyspark spark.read.csv 中的前 n 个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!