将 CSV 导入 Spark DataFrame 时出现 java.io.StreamCorruptedException [英] java.io.StreamCorruptedException when importing a CSV to a Spark DataFrame
问题描述
我在 standalone
模式下运行 Spark 集群.Master 和 Worker 节点均可访问,并在 Spark Web UI 中提供日志.
I'm running a Spark cluster in standalone
mode. Both Master and Worker nodes are reachable, with logs in the Spark Web UI.
我正在尝试将数据加载到 PySpark 会话中,以便我可以处理 Spark DataFrame.
I'm trying to load data into a PySpark session so I can work on Spark DataFrames.
以下几个示例(其中一个来自 官方文档),我尝试使用不同的方法,但都以相同的错误失败.例如
Following several examples (among them, one from the official doc), I tried using different methods, all failing with the same error. Eg
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('NAME').setMaster('spark://HOST:7077')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()
# a try
df = spark.read.load('/path/to/file.csv', format='csv', sep=',', header=True)
# another try
sql_ctx = SQLContext(sc)
df = sql_ctx.read.csv('/path/to/file.csv', header=True)
# and a few other tries...
每次都出现同样的错误:
Every time, I get the same error:
Py4JJavaError:调用 o81.csv 时发生错误.:
Py4JJavaError: An error occurred while calling o81.csv. :
org.apache.spark.SparkException:由于阶段失败,作业中止:阶段 0.0 中的任务 0 失败 4 次,最近失败:丢失任务 0.3在阶段 0.0 (TID 3, 192.168.X.X, executor 0):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.X.X, executor 0):
java.io.StreamCorruptedException:无效的流标头:0000000B
java.io.StreamCorruptedException: invalid stream header: 0000000B
我正在从 JSON 和 CSV 加载数据(当然要适当调整方法调用),每次都出现相同的错误.
I'm loading data from JSON and CSV (tweaking the methods calls appropriately of course), the error is the same for both, every time.
有人明白这是什么问题吗?
Does someone understand what is the problem?
推荐答案
对于谁可能关心的问题,我终于找到了感谢 这个回复.
To whom it may concern, I finally figured out the problem thank to this response.
pyspark
版本与 Spark 应用程序版本 (2.4 VS 2.3) 不匹配.
pyspark
version for the SparkSession
did not match Spark application version (2.4 VS 2.3).
在 2.3 版本下重新安装 pyspark
立即解决了问题.#facepalm
Re-installing pyspark
under version 2.3 solved instantly the issues. #facepalm
这篇关于将 CSV 导入 Spark DataFrame 时出现 java.io.StreamCorruptedException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!