使用 Apache Spark 读取 JSON - `corrupt_record` [英] Reading JSON with Apache Spark - `corrupt_record`
问题描述
我有一个 json
文件,nodes
看起来像这样:
I have a json
file, nodes
that looks like this:
[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]
我能够使用 Python 读取和操作此记录.
I am able to read and manipulate this record with Python.
我正在尝试通过 spark-shell
在 scala
中读取这个文件.
I am trying to read this file in scala
through the spark-shell
.
从这个教程,我可以看到可以通过 sqlContext.read.json
From this tutorial, I can see that it is possible to read json
via sqlContext.read.json
val vfile = sqlContext.read.json("path/to/file/nodes.json")
然而,这会导致 corrupt_record
错误:
However, this results in a corrupt_record
error:
vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
谁能解释一下这个错误?我可以在其他应用程序中读取和使用该文件,并且我确信它没有损坏和健全的 json
.
Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json
.
推荐答案
Spark 无法将 JSON-array 读取到顶级记录,因此您必须通过:
Spark cannot read JSON-array to a record on top-level, so you have to pass:
{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}
正如教程中所述 你指的是:
让我们从加载一个 JSON 文件开始,其中每一行是一个 JSON 对象
Let's begin by loading a JSON file, where each line is a JSON object
道理很简单.Spark 希望你传递一个包含大量 JSON 实体(每行一个实体)的文件,这样它就可以分发它们的处理(每个实体,粗略地说).
The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).
为了更清楚地说明这一点,这里有一个报价表 官方文档
To put more light on it, here is a quote form the official doc
请注意,作为 json 文件提供的文件不是典型的JSON 文件.每行必须包含一个单独的、自包含的有效JSON 对象.因此,常规的多行 JSON 文件将最常失败.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
这种格式称为 JSONL.基本上它是 CSV 的替代品.
This format is called JSONL. Basically it's an alternative to CSV.
这篇关于使用 Apache Spark 读取 JSON - `corrupt_record`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!