SPARK read.json 抛出 java.io.IOException: 换行前字节太多 [英] SPARK read.json throwing java.io.IOException: Too many bytes before newline

查看：29 发布时间：2021/11/14 22:42:27 json apache-spark pyspark spark-dataframe bigdata

本文介绍了SPARK read.json 抛出 java.io.IOException: 换行前字节太多的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在读取一个 6gb 的大单行 json 文件时遇到以下错误:

作业因阶段失败而中止:阶段 0.0 中的任务 5 失败 1 次，最近一次失败:阶段 0.0 中的任务 5.0 丢失(TID 5，本地主机):java.io.IOException:换行前的字节太多: 2147483648

spark 不会读取带有新行的 json 文件，因此整个 6 GB json 文件都在一行上:

jf = sqlContext.read.json("jlrn2.json")

配置:

spark.driver.memory 20g

解决方案

是的，您的行中有超过 Integer.MAX_VALUE 个字节.你需要把它分开.

请记住，Spark 期望每一行都是有效的 JSON 文档，而不是整个文件.下面是来自 Spark SQL 编程指南

<块引用>

请注意，作为 json 文件提供的文件不是典型的 JSON 文件.每行必须包含一个单独的、自包含的有效 JSON 对象.因此，常规的多行 JSON 文件通常会失败.

因此，如果您的 JSON 文档采用以下形式...

<预><代码>[{ [记录] }，{ [记录] }]

您需要将其更改为

{ [记录] }{ [记录] }

I am getting following error on reading a large 6gb single line json file:

Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648

spark does not read json files with new lines hence the entire 6 gb json file is on a single line:

jf = sqlContext.read.json("jlrn2.json")

configuration:

spark.driver.memory 20g

解决方案

Yep, you have more than Integer.MAX_VALUE bytes in your line. You need to split it up.

Keep in mind that Spark is expecting each line to be a valid JSON document, not the file as a whole. Below is the relevant line from the Spark SQL Progamming Guide

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

So if your JSON document is in the form...

[
  { [record] },
  { [record] }
]

You'll want to change it to

{ [record] }
{ [record] }

这篇关于SPARK read.json 抛出 java.io.IOException: 换行前字节太多的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SPARK read.json 抛出 java.io.IOException: 换行前字节太多 [英] SPARK read.json throwing java.io.IOException: Too many bytes before newline

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

SPARK read.json 抛出 java.io.IOException: 换行前字节太多 [英] SPARK read.json throwing java.io.IOException: Too many bytes before newline

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭