使用 pyspark 的 toPandas() 错误:“int"对象不可迭代 [英] toPandas() error using pyspark: 'int' object is not iterable

查看:34
本文介绍了使用 pyspark 的 toPandas() 错误:“int"对象不可迭代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 pyspark 数据框,我正在尝试使用 toPandas() 将其转换为 Pandas,但是我遇到了下面提到的错误.

I have a pyspark dataframe and I am trying to convert it to pandas using toPandas(), however I am running into below mentioned error.

我尝试了不同的选项,但得到了相同的错误:
1) 将数据限制为仅几条记录
2) 明确使用 collect()(我相信 toPandas() 固有地使用)

I tried different options but got the same error:
1) limit the data to just few records
2) used collect() explicitly (which I believe toPandas() uses inherently)

探索了许多关于 SO 的帖子,但 AFAIK 没有一个有 toPandas() 问题.

Explored many posts on SO, but AFAIK none has toPandas() issue.

我的数据帧的快照:-

>>sc.version 
2.3.0.2.6.5.0-292

>>print(type(df4),len(df4.columns),df4.count(),
(<class 'pyspark.sql.dataframe.DataFrame'>, 13, 296327)

>>df4.printSchema()
 root
  |-- id: string (nullable = true)
  |-- gender: string (nullable = true)
  |-- race: string (nullable = true)
  |-- age: double (nullable = true)
  |-- status: integer (nullable = true)
  |-- height: decimal(6,2) (nullable = true)
  |-- city: string (nullable = true)
  |-- county: string (nullable = true)
  |-- zipcode: string (nullable = true)
  |-- health: double (nullable = true)
  |-- physical_inactivity: double (nullable = true)
  |-- exercise: double (nullable = true)
  |-- weight: double (nullable = true)

  >>df4.limit(2).show()
+------+------+------+----+-------+-------+---------+-------+-------+------+-------------------+--------+------------+
|id    |gender|race  |age |status |height | city    |county |zipcode|health|physical_inactivity|exercise|weight      |
+------+------+------+----+-------+-------+---------+-------+-------+------+-------------------+--------+------------+
| 90001|  MALE| WHITE|61.0|      0|  70.51|DALEADALE|FIELD  |  29671|  null|               29.0|    49.0|       162.0|
| 90005|  MALE| WHITE|82.0|      0|  71.00|DALEBDALE|FIELD  |  36658|  16.0|               null|    49.0|       195.0|
+------+------+------+----+-------+-------+---------+-------+-------+------+-------------------+--------+------------+
*had to mask few features due to data privacy concerns

错误:-

>>df4.limit(10).toPandas()

'int' object is not iterable
Traceback (most recent call last):
  File "/repo/python2libs/pyspark/sql/dataframe.py", line 1968, in toPandas
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
  File "/repo/python2libs/pyspark/sql/dataframe.py", line 467, in collect
return list(_load_from_socket(sock_info,     BatchedSerializer(PickleSerializer())))
  File "/repo/python2libs/pyspark/rdd.py", line 142, in _load_from_socket
port, auth_secret = sock_info
TypeError: 'int' object is not iterable

推荐答案

我们的自定义库库有一个 pyspark 包,它与 Spark 集群提供的 pyspark 发生冲突,并且以某种方式在 Spark shell 上运行不能在笔记本上工作.
因此,重命名自定义存储库中的 pyspark 库解决了该问题!

Our custom repository of libraries had a package for pyspark which was clashing with the pyspark that is provided by the spark cluster and somehow having both works on Spark shell but does not work on a notebook.
So, renaming the pyspark library in the custom repository resolved the issue!

这篇关于使用 pyspark 的 toPandas() 错误:“int"对象不可迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆