使用Apache Spark读取JSON-`corrupt_record` [英] Reading JSON with Apache Spark - `corrupt_record`

查看：443 发布时间：2019/11/23 19:02:47 json scala apache-spark

本文介绍了使用Apache Spark读取JSON-`corrupt_record`的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个json文件，nodes看起来像这样:

I have a json file, nodes that looks like this:

[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]

我能够使用Python读取和处理此记录.

I am able to read and manipulate this record with Python.

我正在尝试通过spark-shell在scala中读取此文件.

I am trying to read this file in scala through the spark-shell.

通过教程，我可以看到可以通过sqlContext.read.json

From this tutorial, I can see that it is possible to read json via sqlContext.read.json

val vfile = sqlContext.read.json("path/to/file/nodes.json")

但是，这会导致corrupt_record错误:

vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

有人可以阐明这个错误吗?我可以读取该文件并将其与其他应用程序一起使用，并且我确信该文件没有损坏，并且声音json.

Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.

推荐答案

Spark无法将JSON数组读取到顶级记录，因此您必须通过:

Spark cannot read JSON-array to a record on top-level, so you have to pass:

{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} 
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} 
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} 
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}

本教程您指的是:

让我们首先加载一个JSON文件，其中每行是一个JSON对象

Let's begin by loading a JSON file, where each line is a JSON object

推理非常简单. Spark希望您传递带有很多JSON实体(每行实体)的文件，以便它可以分发它们的处理(按每个实体粗略地说).

The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).

要对此进行更多说明，请使用报价表

To put more light on it, here is a quote form the official doc

请注意，作为json文件提供的文件不是典型的 JSON文件.每行必须包含一个单独的，自包含的有效 JSON对象.因此，常规的多行JSON文件将最常失败.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

此格式称为 JSONL .基本上，它是CSV的替代方法.

This format is called JSONL. Basically it's an alternative to CSV.

这篇关于使用Apache Spark读取JSON-`corrupt_record`的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Apache Spark读取JSON-`corrupt_record` [英] Reading JSON with Apache Spark - `corrupt_record`

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用Apache Spark读取JSON-`corrupt_record` [英] Reading JSON with Apache Spark - `corrupt_record`

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭