数据帧列中动态长度的Pyspark字符串数组要进行一次热编码 [英] Pyspark string array of dynamic length in dataframe column to onehot-encoded
问题描述
我想转换一个包含如下字符串的列:
I would like to convert a column which contains strings like:
["ABC","def","ghi"]
["Jkl","ABC","def"]
["Xyz","ABC"]
进入这样的编码列:
[1,1,1,0,0]
[1,1,0,1,0]
[0,1,0,0,1]
pyspark.ml.feature中是否有针对该类的课程?
Is there a class for that in pyspark.ml.feature?
在编码列中,第一个条目始终对应于值"ABC"等.1表示存在"ABC",而0表示在相应的行中不存在.
In the encoded column the first entry always corresponds to the value "ABC" etc. 1 means "ABC" is present while 0 means it is not present in the corresponding row.
推荐答案
You can probably use CountVectorizer, Below is an example:
更新:删除了在数组中删除重复项的步骤,可以在设置CountVectorizer时设置binary=True
:
Update: removed the step to drop duplicates in arrays, you can set binary=True
when setting up CountVectorizer:
from pyspark.ml.feature import CountVectorizer
from pyspark.sql.functions import udf, col
df = spark.createDataFrame([
(["ABC","def","ghi"],)
, (["Jkl","ABC","def"],)
, (["Xyz","ABC"],)
], ['arr']
)
创建CountVectorizer模型:
create the CountVectorizer model:
cv = CountVectorizer(inputCol='arr', outputCol='c1', binary=True)
model = cv.fit(df)
vocabulary = model.vocabulary
# [u'ABC', u'def', u'Xyz', u'ghi', u'Jkl']
创建UDF以将向量转换为数组
Create a UDF to convert a vector into array
udf_to_array = udf(lambda v: v.toArray().tolist(), 'array<double>')
获取矢量并检查其内容:
Get the vector and check the content:
df1 = model.transform(df)
df1.withColumn('c2', udf_to_array('c1')) \
.select('*', *[ col('c2')[i].astype('int').alias(vocabulary[i]) for i in range(len(vocabulary))]) \
.show(3,0)
+---------------+-------------------------+-------------------------+---+---+---+---+---+
|arr |c1 |c2 |ABC|def|Xyz|ghi|Jkl|
+---------------+-------------------------+-------------------------+---+---+---+---+---+
|[ABC, def, ghi]|(5,[0,1,3],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 1.0, 0.0]|1 |1 |0 |1 |0 |
|[Jkl, ABC, def]|(5,[0,1,4],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 0.0, 1.0]|1 |1 |0 |0 |1 |
|[Xyz, ABC] |(5,[0,2],[1.0,1.0]) |[1.0, 0.0, 1.0, 0.0, 0.0]|1 |0 |1 |0 |0 |
+---------------+-------------------------+-------------------------+---+---+---+---+---+
这篇关于数据帧列中动态长度的Pyspark字符串数组要进行一次热编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!