Apache Spark每条记录RDD或DF读取一个复杂的JSON文件 [英] Apache Spark Read One Complex JSON File Per Record RDD or DF
问题描述
我有一个充满以下JSON文件格式的HDFS目录:
I have an HDFS directory full of the following JSON file format:
https://www.hl7.org/fhir/bundle-transaction.json.html
我希望做的是找到一种方法来展平每个单独的文件,使其成为一个df记录或rdd元组.我已经尝试过使用read.json(),wholeTextFiles()等所有可以想到的东西.
What I am hoping to do is find an approach to flatten each individual file to become one df record or rdd tuple. I have tried everything I could think of using read.json(), wholeTextFiles(), etc.
如果任何人有任何最佳做法建议或建议,我们将不胜感激.
If anyone has any best practices advice or pointers, it would be sincerely appreciated.
推荐答案
通过 wholeTextFiles
进行加载,如下所示:
Load via wholeTextFiles
something like this:
sc.wholeTextFiles(...) //RDD[(FileName, JSON)
.map(...processJSON...) //RDD[JsonObject]
然后,您只需调用 .toDF
方法,即可从您的 JsonObject
进行推断.
Then, you can simply call the .toDF
method so that it will infer from your JsonObject
.
对于 processJSON
方法,您可以使用 播放
json解析器
As far as the processJSON
method, you could just use something like the Play
json parser
这篇关于Apache Spark每条记录RDD或DF读取一个复杂的JSON文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!