在 Apache Spark 中读取多行 JSON [英] Read multiline JSON in Apache Spark

查看:21
本文介绍了在 Apache Spark 中读取多行 JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将 JSON 文件用作小型数据库.在 DataFrame 上创建模板表后,我用 SQL 查询它并得到一个异常.这是我的代码:

I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:

val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")

val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()

df.printSchema() 结果:

root
 |-- _corrupt_record: string (nullable = true)

我的 JSON 文件:

My JSON file:

{
  "id": 1,
  "name": "Morty",
  "age": 21
}

例外:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];

我该如何解决?

UPD

_corrupt_record

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|            "id": 1,|
|    "name": "Morty",|
|           "age": 21|
|                   }|
+--------------------+

UPD2

这很奇怪,但是当我重写我的 JSON 使其成为单行时,一切正常.

It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.

{"id": 1, "name": "Morty", "age": 21}

所以问题出在 newline 中.

UPD3

我在文档中找到了下一句:

I found in docs the next sentence:

请注意,作为 json 文件提供的文件不是典型的 JSON 文件.每行必须包含一个单独的、自包含的有效 JSON 对象.因此,常规的多行 JSON 文件通常会失败.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

以这种格式保存 JSON 并不方便.是否有任何解决方法可以摆脱 JSON 的多行结构或将其转换为 oneliner?

It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?

推荐答案

Spark >= 2.2

Spark 2.2 引入了 wholeFile multiLine 选项,可用于加载 JSON(非 JSONL)文件:

Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:

spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/path/to/user.json")

见:

  • SPARK-18352 - 解析普通的多行 JSON文件(不仅仅是 JSON 行).
  • SPARK-20980 - 重命名选项 wholeFilemultiLine 用于 JSON 和 CSV.
  • SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
  • SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.

火花<2.2

好吧,使用 JSONL 格式的数据可能不方便,但我认为这不是 API 的问题,而是格式本身的问题.JSON 根本不是为了在分布式系统中并行处理而设计的.

Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.

它不提供模式,如果不对格式和形状做一些非常具体的假设,几乎不可能正确识别顶级文档.可以说,这是在像 Apache Spark 这样的系统中使用的最糟糕的格式.在分布式系统中编写有效的 JSON 也非常棘手且通常不切实际.

It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.

话虽如此,如果单个文件是有效的 JSON 文档(单个文档或文档数组),您可以随时尝试 wholeTextFiles:

That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:

spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())

这篇关于在 Apache Spark 中读取多行 JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆