将 CSV 导入 Spark DataFrame 时出现 java.io.StreamCorruptedException [英] java.io.StreamCorruptedException when importing a CSV to a Spark DataFrame

查看:53
本文介绍了将 CSV 导入 Spark DataFrame 时出现 java.io.StreamCorruptedException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 standalone 模式下运行 Spark 集群.Master 和 Worker 节点均可访问,并在 Spark Web UI 中提供日志.

I'm running a Spark cluster in standalone mode. Both Master and Worker nodes are reachable, with logs in the Spark Web UI.

我正在尝试将数据加载到 PySpark 会话中,以便我可以处理 Spark DataFrame.

I'm trying to load data into a PySpark session so I can work on Spark DataFrames.

以下几个示例(其中一个来自 官方文档),我尝试使用不同的方法,但都以相同的错误失败.例如

Following several examples (among them, one from the official doc), I tried using different methods, all failing with the same error. Eg

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SQLContext

conf = SparkConf().setAppName('NAME').setMaster('spark://HOST:7077')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

# a try
df = spark.read.load('/path/to/file.csv', format='csv', sep=',', header=True)

# another try
sql_ctx = SQLContext(sc)
df = sql_ctx.read.csv('/path/to/file.csv', header=True)

# and a few other tries...

每次都出现同样的错误:

Every time, I get the same error:

Py4JJavaError:调用 o81.csv 时发生错误.:

Py4JJavaError: An error occurred while calling o81.csv. :

org.apache.spark.SparkException:由于阶段失败,作业中止:阶段 0.0 中的任务 0 失败 4 次,最近失败:丢失任务 0.3在阶段 0.0 (TID 3, 192.168.X.X, executor 0):

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.X.X, executor 0):

java.io.StreamCorruptedException:无效的流标头:0000000B

java.io.StreamCorruptedException: invalid stream header: 0000000B

我正在从 JSON 和 CSV 加载数据(当然要适当调整方法调用),每次都出现相同的错误.

I'm loading data from JSON and CSV (tweaking the methods calls appropriately of course), the error is the same for both, every time.

有人明白这是什么问题吗?

Does someone understand what is the problem?

推荐答案

对于谁可能关心的问题,我终于找到了感谢 这个回复.

To whom it may concern, I finally figured out the problem thank to this response.

pyspark 版本与 Spark 应用程序版本 (2.4 VS 2.3) 不匹配.

pyspark version for the SparkSession did not match Spark application version (2.4 VS 2.3).

在 2.3 版本下重新安装 pyspark 立即解决了问题.#facepalm

Re-installing pyspark under version 2.3 solved instantly the issues. #facepalm

这篇关于将 CSV 导入 Spark DataFrame 时出现 java.io.StreamCorruptedException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆