在Apache Spark中读取多行JSON [英] Read multiline JSON in Apache Spark

查看:496
本文介绍了在Apache Spark中读取多行JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将JSON文件用作小型数据库.在DataFrame上创建模板表后,我使用SQL查询了该表并得到了异常.这是我的代码:

I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:

val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")

val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()

df.printSchema()结果:

root
 |-- _corrupt_record: string (nullable = true)

我的JSON文件:

{
  "id": 1,
  "name": "Morty",
  "age": 21
}

项目:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];

我该如何解决?

UPD

_corrupt_record

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|            "id": 1,|
|    "name": "Morty",|
|           "age": 21|
|                   }|
+--------------------+

UPD2

这很奇怪,但是当我重写JSON以使其成为一体时,一切正常.

It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.

{"id": 1, "name": "Morty", "age": 21}

所以问题出在newline.

UPD3

我在文档中发现了下一个句子:

I found in docs the next sentence:

请注意,以json文件形式提供的文件不是典型的JSON文件.每行必须包含一个单独的,自包含的有效JSON对象.因此,常规的多行JSON文件通常会失败.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

将JSON保持这种格式并不方便.是否有任何解决方法可以摆脱JSON的多行结构或将其转换为oneliner?

It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?

推荐答案

火花> = 2.2

Spark 2.2引入了 wholeFile multiLine选项,该选项可用于加载JSON(非JSONL)文件:

Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:

spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/path/to/user.json")

请参阅:

  • SPARK-18352 -解析普通的多行JSON文件(不仅仅是JSON行).
  • SPARK-20980 -将选项wholeFile重命名为multiLine用于JSON和CSV.
  • SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
  • SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.

火花< 2.2

嗯,使用JSONL格式的数据可能会带来不便,但是我认为这不是API的问题,而是格式本身的问题. JSON并非设计为在分布式系统中并行处理.

Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.

它不提供任何模式,并且在不对其格式和形状做一些非常具体的假设的情况下,几乎不可能正确识别顶级文档.可以说,这是想像在Apache Spark之类的系统中使用的最糟糕的格式.在分布式系统中编写有效的JSON也是非常棘手的,并且通常是不切实际的.

It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.

话虽如此,如果单个文件是有效的JSON文档(单个文档或文档数组),则可以随时尝试wholeTextFiles:

That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:

spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())

这篇关于在Apache Spark中读取多行JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆