如何在PySpark中进行装箱? [英] How to bin in PySpark?

查看：120 发布时间：2021/4/8 19:24:22 apache-spark pyspark

本文介绍了如何在PySpark中进行装箱?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

例如，我想根据年龄将 DataFrame 人分类为以下4个垃圾箱.

For example, I'd like to classify a DataFrame of people into the following 4 bins according to age.

age_bins = [0, 6, 18, 60, np.Inf]
age_labels = ['infant', 'minor', 'adult', 'senior']

我会使用 pandas.cut()在 pandas 中执行此操作.如何在 PySpark 中做到这一点?

I would use pandas.cut() to do this in pandas. How do I do this in PySpark?

推荐答案

您可以在Spark中使用来自ml库的Bucketizer功能转换.

You can use Bucketizer feature transfrom from ml library in spark.

values = [("a", 23), ("b", 45), ("c", 10), ("d", 60), ("e", 56), ("f", 2), ("g", 25), ("h", 40), ("j", 33)]


df = spark.createDataFrame(values, ["name", "ages"])


from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, 6, 18, 60, float('Inf') ],inputCol="ages", outputCol="buckets")
df_buck = bucketizer.setHandleInvalid("keep").transform(df)

df_buck.show()

输出

+----+----+-------+
|name|ages|buckets|
+----+----+-------+
|   a|  23|    2.0|
|   b|  45|    2.0|
|   c|  10|    1.0|
|   d|  60|    3.0|
|   e|  56|    2.0|
|   f|   2|    0.0|
|   g|  25|    2.0|
|   h|  40|    2.0|
|   j|  33|    2.0|
+----+----+-------+

如果您想要每个存储桶的名称，则可以使用udf创建带有存储桶名称的新列

If you want names for each bucket you can use udf to create a new column with bucket names

from pyspark.sql.functions import udf
from pyspark.sql.types import *

t = {0.0:"infant", 1.0: "minor", 2.0:"adult", 3.0: "senior"}
udf_foo = udf(lambda x: t[x], StringType())
df_buck.withColumn("age_bucket", udf_foo("buckets")).show()

输出

+----+----+-------+----------+
|name|ages|buckets|age_bucket|
+----+----+-------+----------+
|   a|  23|    2.0|     adult|
|   b|  45|    2.0|     adult|
|   c|  10|    1.0|     minor|
|   d|  60|    3.0|    senior|
|   e|  56|    2.0|     adult|
|   f|   2|    0.0|    infant|
|   g|  25|    2.0|     adult|
|   h|  40|    2.0|     adult|
|   j|  33|    2.0|     adult|
+----+----+-------+----------+

这篇关于如何在PySpark中进行装箱?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在PySpark中进行装箱? [英] How to bin in PySpark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在PySpark中进行装箱? [英] How to bin in PySpark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭