Spark 2.1 PySpark 错误:sc.textFile("test.txt").repartition(2).collect() [英] Spark 2.1 PySpark Bug: sc.textFile("test.txt").repartition(2).collect()

查看：79 发布时间：2021/6/24 20:41:42 apache-spark pyspark

本文介绍了Spark 2.1 PySpark 错误:sc.textFile("test.txt").repartition(2).collect()的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用普通文本文件:

echo "a\nb\nc\nd" >> test.txt

使用 vanilla spark-2.1.0-bin-hadoop2.7.tgz，以下失败.相同的测试适用于旧版本的 Spark:

Using the vanilla spark-2.1.0-bin-hadoop2.7.tgz, the following fails. The same test works fine with older versions of Spark:

$ bin/pyspark

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
SparkSession available as 'spark'.

>>> sc.textFile("test.txt").collect()
[u'a', u'b', u'c', u'd']

>>> sc.textFile("test.txt").repartition(2).collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/admin/opt/spark/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py", line 810, in collect
    return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/Users/admin/opt/spark/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py", line 140, in _load_from_socket
    for item in serializer.load_stream(rf):
  File "/Users/admin/opt/spark/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.py", line 529, in load_stream
    yield self.loads(stream)
  File "/Users/admin/opt/spark/spark-2.1.0-bin-hadoop2.7/python/pyspark/serializers.py", line 524, in loads
    return s.decode("utf-8") if self.use_unicode else s
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

使用 vanilla Spark 2.1 本地安装，相同的文本文件，但基于 Scala 的 spark-shell，完全相同的命令工作:

Using the vanilla Spark 2.1 local installation, same text file, but the Scala-based spark-shell, the exact same commands work:

$ bin/spark-shell

Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc.textFile("test.txt").collect()
res0: Array[String] = Array(a, b, c, d)

scala> sc.textFile("test.txt").repartition(2).collect()
res1: Array[String] = Array(a, c, d, b)

Spark 2.1 PySpark 错误:sc.textFile("test.txt").repartition(2).collect() [英] Spark 2.1 PySpark Bug: sc.textFile("test.txt").repartition(2).collect()

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 2.1 PySpark 错误:sc.textFile("test.txt").repartition(2).collect() [英] Spark 2.1 PySpark Bug: sc.textFile(&quot;test.txt&quot;).repartition(2).collect()

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark 2.1 PySpark 错误:sc.textFile("test.txt").repartition(2).collect() [英] Spark 2.1 PySpark Bug: sc.textFile("test.txt").repartition(2).collect()

登录关闭