当类型很好时，对PySpark DataFrame的求和操作会给出TypeError [英] Sum operation on PySpark DataFrame giving TypeError when type is fine

查看：295 发布时间：2020/9/4 2:15:20 python apache-spark dataframe pyspark

本文介绍了当类型很好时，对PySpark DataFrame的求和操作会给出TypeError的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在PySpark中有这样的DataFrame(这是take(3)的结果，该数据帧非常大):

I have such DataFrame in PySpark (this is the result of a take(3), the dataframe is very big):

sc = SparkContext()
df = [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)]

相同的所有者将具有更多的行.我需要做的是将分组后每个所有者的a_d字段的值求和为

the same owner will have more rows. What I need to do is summing the values of the field a_d per owner, after grouping, as

b = df.groupBy('owner').agg(sum('a_d').alias('a_d_sum'))

但这会引发错误

TypeError:+不支持的操作数类型:"int"和"str"

TypeError: unsupported operand type(s) for +: 'int' and 'str'

但是，该架构包含双精度值，而不是字符串(它来自printSchema()):

However, the schema contains double values, not strings (this comes from a printSchema()):

root
|-- owner: string (nullable = true)
|-- a_d: double (nullable = true)

那么这是怎么回事?

推荐答案

您没有使用正确的求和函数，而是使用了built-in函数sum(默认情况下).

You are not using the correct sum function but the built-in function sum (by default).

所以build-in函数不起作用的原因是这是因为它需要一个可迭代的参数，其中此处传递的列的名称是字符串，而built-in函数不能应用于字符串. 参考. Python官方文档.

So the reason why the build-in function won't work is that's it takes an iterable as an argument where as here the name of the column passed is a string and the built-in function can't be applied on a string. Ref. Python Official Documentation.

您需要从pyspark.sql.functions导入适当的功能:

You'll need to import the proper function from pyspark.sql.functions :

from pyspark.sql import Row
from pyspark.sql.functions import sum as _sum

df = sqlContext.createDataFrame(
    [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)]
)

df2 = df.groupBy('owner').agg(_sum('a_d').alias('a_d_sum'))
df2.show()

# +-----+-------+
# |owner|a_d_sum|
# +-----+-------+
# |   u1|    0.4|
# |   u2|    0.0|
# +-----+-------+

这篇关于当类型很好时，对PySpark DataFrame的求和操作会给出TypeError的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

当类型很好时，对PySpark DataFrame的求和操作会给出TypeError [英] Sum operation on PySpark DataFrame giving TypeError when type is fine

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

当类型很好时，对PySpark DataFrame的求和操作会给出TypeError [英] Sum operation on PySpark DataFrame giving TypeError when type is fine

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭