如何在PySpark中进行装箱? [英] How to bin in PySpark?
本文介绍了如何在PySpark中进行装箱?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
例如,我想根据年龄将 DataFrame
人分类为以下4个垃圾箱.
For example, I'd like to classify a DataFrame
of people into the following 4 bins according to age.
age_bins = [0, 6, 18, 60, np.Inf]
age_labels = ['infant', 'minor', 'adult', 'senior']
我会使用 pandas.cut()
在 pandas
中执行此操作.如何在 PySpark
中做到这一点?
I would use pandas.cut()
to do this in pandas
. How do I do this in PySpark
?
推荐答案
您可以在Spark中使用来自ml库的Bucketizer功能转换.
You can use Bucketizer feature transfrom from ml library in spark.
values = [("a", 23), ("b", 45), ("c", 10), ("d", 60), ("e", 56), ("f", 2), ("g", 25), ("h", 40), ("j", 33)]
df = spark.createDataFrame(values, ["name", "ages"])
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, 6, 18, 60, float('Inf') ],inputCol="ages", outputCol="buckets")
df_buck = bucketizer.setHandleInvalid("keep").transform(df)
df_buck.show()
输出
+----+----+-------+
|name|ages|buckets|
+----+----+-------+
| a| 23| 2.0|
| b| 45| 2.0|
| c| 10| 1.0|
| d| 60| 3.0|
| e| 56| 2.0|
| f| 2| 0.0|
| g| 25| 2.0|
| h| 40| 2.0|
| j| 33| 2.0|
+----+----+-------+
如果您想要每个存储桶的名称,则可以使用udf创建带有存储桶名称的新列
If you want names for each bucket you can use udf to create a new column with bucket names
from pyspark.sql.functions import udf
from pyspark.sql.types import *
t = {0.0:"infant", 1.0: "minor", 2.0:"adult", 3.0: "senior"}
udf_foo = udf(lambda x: t[x], StringType())
df_buck.withColumn("age_bucket", udf_foo("buckets")).show()
输出
+----+----+-------+----------+
|name|ages|buckets|age_bucket|
+----+----+-------+----------+
| a| 23| 2.0| adult|
| b| 45| 2.0| adult|
| c| 10| 1.0| minor|
| d| 60| 3.0| senior|
| e| 56| 2.0| adult|
| f| 2| 0.0| infant|
| g| 25| 2.0| adult|
| h| 40| 2.0| adult|
| j| 33| 2.0| adult|
+----+----+-------+----------+
这篇关于如何在PySpark中进行装箱?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文