pyspark近似量子函数 [英] pyspark approxQuantile function
问题描述
我有带有这些列id
,price
,timestamp
的数据框.
I have dataframe with these columns id
, price
, timestamp
.
我想找到按id
分组的中位数.
I would like to find median value grouped by id
.
我正在使用此代码来查找它,但它给了我这个错误.
I am using this code to find it but it's giving me this error.
from pyspark.sql import DataFrameStatFunctions as statFunc
windowSpec = Window.partitionBy("id")
median = statFunc.approxQuantile("price",
[0.5],
0) \
.over(windowSpec)
return df.withColumn("Median", median)
是否无法使用DataFrameStatFunctions
填充新列中的值?
Is it not possible to use DataFrameStatFunctions
to fill values in new column?
TypeError: unbound method approxQuantile() must be called with DataFrameStatFunctions instance as first argument (got str instance instead)
推荐答案
好吧,确实不可能不可能使用approxQuantile
来填充新数据框列中的值,但这不是为什么您收到此错误.不幸的是,整个故事的背后是一个相当令人沮丧的故事,例如我已经争论过具有许多Spark(尤其是PySpark)功能,并且缺少足够的文档.
Well, indeed it is not possible to use approxQuantile
to fill values in a new dataframe column, but this is not why you are getting this error. Unfortunately, the whole underneath story is a rather frustrating one, as I have argued that is the case with many Spark (especially PySpark) features and their lack of adequate documentation.
首先,没有一个方法,但是有两个 approxQuantile
方法; 第一个是标准DataFrame类的一部分,即您不需要导入DataFrameStatFunctions:
To start with, there is not one, but two approxQuantile
methods; the first one is part of the standard DataFrame class, i.e. you don't need to import DataFrameStatFunctions:
spark.version
# u'2.1.1'
sampleData = [("bob","Developer",125000),("mark","Developer",108000),("carl","Tester",70000),("peter","Developer",185000),("jon","Tester",65000),("roman","Tester",82000),("simon","Developer",98000),("eric","Developer",144000),("carlos","Tester",75000),("henry","Developer",110000)]
df = spark.createDataFrame(sampleData, schema=["Name","Role","Salary"])
df.show()
# +------+---------+------+
# | Name| Role|Salary|
# +------+---------+------+
# | bob|Developer|125000|
# | mark|Developer|108000|
# | carl| Tester| 70000|
# | peter|Developer|185000|
# | jon| Tester| 65000|
# | roman| Tester| 82000|
# | simon|Developer| 98000|
# | eric|Developer|144000|
# |carlos| Tester| 75000|
# | henry|Developer|110000|
# +------+---------+------+
med = df.approxQuantile("Salary", [0.5], 0.25) # no need to import DataFrameStatFunctions
med
# [98000.0]
第二个是DataFrameStatFunctions
的一部分,但是如果您按常规使用它,则会报告以下错误:
The second one is part of DataFrameStatFunctions
, but if you use it as you do, you get the error you report:
from pyspark.sql import DataFrameStatFunctions as statFunc
med2 = statFunc.approxQuantile( "Salary", [0.5], 0.25)
# TypeError: unbound method approxQuantile() must be called with DataFrameStatFunctions instance as first argument (got str instance instead)
因为正确的用法是
med2 = statFunc(df).approxQuantile( "Salary", [0.5], 0.25)
med2
# [82000.0]
尽管您将无法在PySpark文档中找到有关此的简单示例(我花了一些时间自己弄清楚)...最好的部分?这两个值不相等:
although you won't be able to find a simple example in the PySpark documentation about this (it took me some time to figure it out myself)... The best part? The two values are not equal:
med == med2
# False
我怀疑这是由于使用了不确定性算法(毕竟,它应该是一个近似中值),即使您使用相同的玩具数据重新运行这些命令您可能会得到不同的值(并且与我在此处报告的值有所不同)-我建议您做一些尝试以得到感觉...
I suspect this is due to the non-deterministic algorithm used (after all, it is supposed to be an approximate median), and even if you re-run the commands with the same toy data you may get different values (and different from the ones I report here) - I suggest to experiment a little to get the feeling...
但是,正如我已经说过的,这不是不能使用approxQuantile
来填充新数据框列中的值的原因-即使使用正确的语法,也会出现其他错误:
But, as I already said, this is not the reason why you cannot use approxQuantile
to fill values in a new dataframe column - even if you use the correct syntax, you will get a different error:
df2 = df.withColumn('median_salary', statFunc(df).approxQuantile( "Salary", [0.5], 0.25))
# AssertionError: col should be Column
在这里,col
引用withColumn
操作的第二个参数,即approxQuantile
,错误消息指出它不是Column
类型-实际上,它是一个列表:
Here, col
refers to the second argument of the withColumn
operation, i.e. the approxQuantile
one, and the error message says that it is not a Column
type - indeed, it is a list:
type(statFunc(df).approxQuantile( "Salary", [0.5], 0.25))
# list
因此,当填充列值时,Spark需要类型为Column
的参数,并且您不能使用列表.这是创建一个新列的示例,该列的每个角色的平均值而不是中位数:
So, when filling column values, Spark expects arguments of type Column
, and you cannot use lists; here is an example of creating a new column with mean values per Role instead of median ones:
import pyspark.sql.functions as func
from pyspark.sql import Window
windowSpec = Window.partitionBy(df['Role'])
df2 = df.withColumn('mean_salary', func.mean(df['Salary']).over(windowSpec))
df2.show()
# +------+---------+------+------------------+
# | Name| Role|Salary| mean_salary|
# +------+---------+------+------------------+
# | carl| Tester| 70000| 73000.0|
# | jon| Tester| 65000| 73000.0|
# | roman| Tester| 82000| 73000.0|
# |carlos| Tester| 75000| 73000.0|
# | bob|Developer|125000|128333.33333333333|
# | mark|Developer|108000|128333.33333333333|
# | peter|Developer|185000|128333.33333333333|
# | simon|Developer| 98000|128333.33333333333|
# | eric|Developer|144000|128333.33333333333|
# | henry|Developer|110000|128333.33333333333|
# +------+---------+------+------------------+
之所以有效,是因为与approxQuantile
相反,mean
返回Column
:
which works because, contrary to approxQuantile
, mean
returns a Column
:
type(func.mean(df['Salary']).over(windowSpec))
# pyspark.sql.column.Column
这篇关于pyspark近似量子函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!