在 pyspark 中聚合 One-Hot 编码功能 [英] Aggregating a One-Hot Encoded feature in pyspark

查看:57
本文介绍了在 pyspark 中聚合 One-Hot 编码功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 python 方面有经验,但对 pyspark 完全陌生.我有包含大约 50M 行的数据框,具有几个分类特征.对于每个功能,我对它们进行了 One-Hot 编码.这是代码的一个简化但具有代表性的示例.

I am experienced in python but totally new to pyspark. I have dataframe that contains about 50M rows, with several categorical features. For each feature, I have One-Hot Encoded them. Here's a simplified but representative example of the code.

从 pyspark.ml.feature 导入 StringIndexer、OneHotEncoder从 pyspark.ml 导入管道

from pyspark.ml.feature import StringIndexer, OneHotEncoder from pyspark.ml import Pipeline

df = sc.parallelize([
     (1, 'grocery'),
     (1, 'drinks'),
     (1, 'bakery'),
     (2, 'grocery'),
     (3, 'bakery'),
     (3, 'bakery'),
 ]).toDF(["id", "category"])

indexer = StringIndexer(inputCol='category', outputCol='categoryIndex')
encoder = OneHotEncoder(inputCol='categoryIndex', outputCol='categoryVec')

pipe = Pipeline(stages = [indexer, encoder])

newDF = pipe.fit(df).transform(df)

给出输出

+---+--------+-------------+-------------+
| id|category|categoryIndex|  categoryVec|
+---+--------+-------------+-------------+
|  1| grocery|          1.0|(2,[1],[1.0])|
|  1|  drinks|          2.0|    (2,[],[])|
|  1|  bakery|          0.0|(2,[0],[1.0])|
|  2| grocery|          1.0|(2,[1],[1.0])|
|  3|  bakery|          0.0|(2,[0],[1.0])|
|  3|  bakery|          0.0|(2,[0],[1.0])|
+---+--------+-------------+-------------+

我现在想对id"进行分组并将categoryVec"列与总和聚合在一起,这样我就可以为每个 id 获取一行,并使用一个向量指示客户正在购买的(可能是几个)类别中的哪一个. 在 Pandas 中,这只是对 pd.get_dummies() 步骤中生成的每一列应用 sum/mean 的情况,但在这里似乎并不那么简单.

I would now like to groupBy 'id' and aggregate the 'categoryVec' column with a sum, so I can get one row per id, with a vector that indicates which of the (possibly several) categories that customer was shopping in. In pandas this would be simply a case of applying sum/mean to each column produced in the pd.get_dummies() step, but here it doesn't seem to be so simple.

然后我会将输出传递给 ML 算法,因此我需要能够在输出上使用 VectorAssembler 或类似方法.

I will then pass the output to ML algorithms so I will need to be able to use VectorAssembler or similar on the output.

哦,我真的需要一个 pyspark 解决方案.

Oh, and I really need a pyspark solution.

非常感谢您的帮助!

推荐答案

您可以为此使用 Counvectorizer.它将类别索引数组转换为编码向量.

You can use Counvectorizer for this. It converts category index array to encoded vector.

from pyspark.ml.feature import CountVectorizer
from pyspark.ml import Pipeline
from pyspark.sql.window import Window
from pyspark.sql import functions as F


df = sc.parallelize([
     (1, 'grocery'),
     (1, 'drinks'),
     (1, 'bakery'),
     (2, 'grocery'),
     (3, 'bakery'),
     (3, 'bakery'),
 ]).toDF(["id", "category"]) \
   .groupBy('id') \
   .agg(F.collect_list('category').alias('categoryIndexes'))

cv = CountVectorizer(inputCol='categoryIndexes', outputCol='categoryVec')

transformed_df = cv.fit(df).transform(df)
transformed_df.show()

结果:

+---+--------------------+--------------------+
| id|     categoryIndexes|         categoryVec|
+---+--------------------+--------------------+
|  1|[grocery, drinks,...|(3,[0,1,2],[1.0,1...|
|  3|    [bakery, bakery]|       (3,[0],[2.0])|
|  2|           [grocery]|       (3,[1],[1.0])|
+---+--------------------+--------------------+

这篇关于在 pyspark 中聚合 One-Hot 编码功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆