找出pyspark array< double>的均值 [英] Find mean of pyspark array<double>

查看:82
本文介绍了找出pyspark array< double>的均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在pyspark中,我有一个可变长度的双精度数组,我想找到其均值.但是,平均值函数需要单个数字类型.

In pyspark, I have a variable length array of doubles for which I would like to find the mean. However, the average function requires a single numeric type.

有没有一种方法可以找到一个数组的平均值而不分解该数组?我有几个不同的数组,我希望能够执行以下操作:

Is there a way to find the average of an array without exploding the array out? I have several different arrays and I'd like to be able to do something like the following:

df.select(col("Segment.Points.trajectory_points.longitude"))

DataFrame [经度:数组]

DataFrame[longitude: array]

df.select(avg(col("Segment.Points.trajectory_points.longitude"))).show()

org.apache.spark.sql.AnalysisException: cannot resolve
'avg(Segment.Points.trajectory_points.longitude)' due to data type
mismatch: function average requires numeric types, not
ArrayType(DoubleType,true);;

如果我有3个具有以下数组的唯一记录,我希望将这些值的平均值作为输出.这将是3个平均经度值.

If I have 3 unique records with the following arrays, I'd like the mean of these values as the output. This would be 3 mean longitude values.

输入:

[Row(longitude=[-80.9, -82.9]),
 Row(longitude=[-82.92, -82.93, -82.94, -82.96, -82.92, -82.92]),
 Row(longitude=[-82.93, -82.93])]

输出:

-81.9,
-82.931,
-82.93

我正在使用Spark版本2.1.3.

I am using spark version 2.1.3.

爆炸解决方案:

所以我通过爆炸使它起作用,但是我希望避免这一步.这就是我所做的

So I've got this working by exploding, but I was hoping to avoid this step. Here's what I did

from pyspark.sql.functions import col
import pyspark.sql.functions as F

longitude_exp = df.select(
    col("ID"), 
    F.posexplode("Segment.Points.trajectory_points.longitude").alias("pos", "longitude")
)

longitude_reduced = long_exp.groupBy("ID").agg(avg("longitude"))

这成功地取了意思.但是,由于我将在几列中执行此操作,因此必须将同一DF爆炸几次.我将继续研究它,以找到一种更清洁的方法.

This successfully took the mean. However, since I'll be doing this for several columns, I'll have to explode the same DF several different times. I'll keep working through it to find a cleaner way to do this.

推荐答案

在您的情况下,可以选择使用explodeudf.正如您已经指出的,explode不必要地昂贵.因此,udf是解决方法.

In your case, your options are use explode or a udf. As you've noted, explode is unnecessarily expensive. Thus, a udf is the way to go.

您可以编写自己的函数以获取数字列表的均值,或者只是背负numpy.mean.如果使用numpy.mean,则必须将结果强制转换为float(因为spark不知道如何处理numpy.float64 s).

You can write your own function to take the mean of a list of numbers, or just piggy back off of numpy.mean. If you use numpy.mean, you'll have to cast the result to a float (because spark doesn't know how to handle numpy.float64s).

import numpy as np
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

array_mean = udf(lambda x: float(np.mean(x)), FloatType())
df.select(array_mean("longitude").alias("avg")).show()
#+---------+
#|      avg|
#+---------+
#|    -81.9|
#|-82.93166|
#|   -82.93|
#+---------+

这篇关于找出pyspark array< double>的均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆