数据帧列中动态长度的 Pyspark 字符串数组为单热编码 [英] Pyspark string array of dynamic length in dataframe column to onehot-encoded

查看：26 发布时间：2021/11/14 22:13:00 pyspark apache-spark-sql transformer

本文介绍了数据帧列中动态长度的 Pyspark 字符串数组为单热编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想转换包含以下字符串的列:

I would like to convert a column which contains strings like:

 ["ABC","def","ghi"] 
 ["Jkl","ABC","def"]
 ["Xyz","ABC"]

变成这样的编码列:

 [1,1,1,0,0]
 [1,1,0,1,0]
 [0,1,0,0,1]

pyspark.ml.feature 中是否有相关类?

Is there a class for that in pyspark.ml.feature?

在编码列中，第一个条目始终对应于值ABC"等.1 表示存在ABC"，而 0 表示相应行中不存在.

In the encoded column the first entry always corresponds to the value "ABC" etc. 1 means "ABC" is present while 0 means it is not present in the corresponding row.

推荐答案

你或许可以使用 CountVectorizer，下面是一个例子:

You can probably use CountVectorizer, Below is an example:

更新:删除了在数组中删除重复项的步骤，您可以在设置CountVectorizer时设置binary=True:

Update: removed the step to drop duplicates in arrays, you can set binary=True when setting up CountVectorizer:

from pyspark.ml.feature import CountVectorizer
from pyspark.sql.functions import udf, col

df = spark.createDataFrame([
        (["ABC","def","ghi"],)
      , (["Jkl","ABC","def"],)
      , (["Xyz","ABC"],)
    ], ['arr']
)

创建 CountVectorizer 模型:

create the CountVectorizer model:

cv = CountVectorizer(inputCol='arr', outputCol='c1', binary=True)

model = cv.fit(df)

vocabulary = model.vocabulary
# [u'ABC', u'def', u'Xyz', u'ghi', u'Jkl']

创建一个 UDF 将向量转换为数组

Create a UDF to convert a vector into array

udf_to_array = udf(lambda v: v.toArray().tolist(), 'array<double>')

获取向量并检查内容:

df1 = model.transform(df)

df1.withColumn('c2', udf_to_array('c1')) \
   .select('*', *[ col('c2')[i].astype('int').alias(vocabulary[i]) for i in range(len(vocabulary))]) \
   .show(3,0)
+---------------+-------------------------+-------------------------+---+---+---+---+---+
|arr            |c1                       |c2                       |ABC|def|Xyz|ghi|Jkl|
+---------------+-------------------------+-------------------------+---+---+---+---+---+
|[ABC, def, ghi]|(5,[0,1,3],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 1.0, 0.0]|1  |1  |0  |1  |0  |
|[Jkl, ABC, def]|(5,[0,1,4],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 0.0, 1.0]|1  |1  |0  |0  |1  |
|[Xyz, ABC]     |(5,[0,2],[1.0,1.0])      |[1.0, 0.0, 1.0, 0.0, 0.0]|1  |0  |1  |0  |0  |
+---------------+-------------------------+-------------------------+---+---+---+---+---+

这篇关于数据帧列中动态长度的 Pyspark 字符串数组为单热编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

数据帧列中动态长度的 Pyspark 字符串数组为单热编码 [英] Pyspark string array of dynamic length in dataframe column to onehot-encoded

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

数据帧列中动态长度的 Pyspark 字符串数组为单热编码 [英] Pyspark string array of dynamic length in dataframe column to onehot-encoded

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭