我可以告诉spark.read.json我的文件已压缩吗? [英] Can I tell spark.read.json that my files are gzipped?
问题描述
我有一个s3存储桶,其中包含将近100k压缩的JSON文件.
I have an s3 bucket with nearly 100k gzipped JSON files.
这些文件称为 [timestamp] .json
,而不是更明智的 [timestamp] .json.gz
.
These files are called [timestamp].json
instead of the more sensible [timestamp].json.gz
.
我还有其他使用它们的进程,因此重命名不是一种选择,并且复制它们甚至不那么理想.
I have other processes that use them so renaming is not an option and copying them is even less ideal.
我正在使用 spark.read.json([pattern])
读取这些文件.如果我将文件名重命名为包含 .gz
,则可以正常工作,但是扩展名仅为 .json
,因此无法读取它们.
I am using spark.read.json([pattern])
to read these files. If I rename the filename to contain the .gz
this works fine, but whilst the extension is just .json
they cannot be read.
有什么办法可以告诉Spark这些文件已压缩吗?
Is there any way I can tell spark that these files are gzipped?
推荐答案
SparkSession可以直接读取压缩的json文件,就像这样:
SparkSession can read compressed json file directly, just like this:
val json = spark.read.json("/user/the_file_path/the_json_file.log.gz")json.printSchema()
这篇关于我可以告诉spark.read.json我的文件已压缩吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!