pyspark - 将一次热编码后获得的稀疏向量转换为列 [英] pyspark - Convert sparse vector obtained after one hot encoding into columns

查看:30
本文介绍了pyspark - 将一次热编码后获得的稀疏向量转换为列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 apache Spark ML lib 来处理使用一种热编码的分类特征.编写以下代码后,我得到一个向量 c_idx_vec 作为一个热编码的输出.我确实了解如何解释这个输出向量,但我无法弄清楚如何将此向量转换为列,以便我获得一个新的转换数据框.以这个数据集为例:

>>>fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])>>>ss = StringIndexer(inputCol="c",outputCol="c_idx")>>>ff = ss.fit(fd).transform(fd)>>>ff.show()+----+---+-----+|×|c|c_idx|+----+---+-----+|1.0|一个|0.0||1.5|一个|0.0||10.0|乙|1.0||3.2|| |2.0|+----+---+-----+

默认情况下,OneHotEncoder 将删除最后一个类别:

>>>oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")>>>fe = oe.transform(ff)>>>fe.show()+----+---+-----+------------+|×|c|c_idx|c_idx_vec|+----+---+-----+------------+|1.0|一个|0.0|(2,[0],[1.0])||1.5|一个|0.0|(2,[0],[1.0])||10.0|乙|1.0|(2,[1],[1.0])||3.2|| |2.0|(2,[],[])|+----+---+-----+------------+

当然,这种行为是可以改变的:

>>>oe.setDropLast(假)>>>fl = oe.transform(ff)>>>fl.show()+----+---+-----+------------+|×|c|c_idx|c_idx_vec|+----+---+-----+------------+|1.0|一个|0.0|(3,[0],[1.0])||1.5|一个|0.0|(3,[0],[1.0])||10.0|乙|1.0|(3,[1],[1.0])||3.2|| |2.0|(3,[2],[1.0])|+----+---+-----+------------+

所以,我想知道如何将我的 c_idx_vec 向量转换为新的数据帧,如下所示:

解决方案

您可以这样做:

<预><代码>>>>从 pyspark.ml.feature 导入 OneHotEncoder,StringIndexer>>>>>>fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])>>>ss = StringIndexer(inputCol="c",outputCol="c_idx")>>>ff = ss.fit(fd).transform(fd)>>>ff.show()+----+---+-----+|×|c|c_idx|+----+---+-----+|1.0|一个|0.0||1.5|一个|0.0||10.0|乙|1.0||3.2|| |2.0|+----+---+-----+>>>>>>oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")>>>oe.setDropLast(假)OneHotEncoder_49e58b281387d8dc0c6b>>>fl = oe.transform(ff)>>>fl.show()+----+---+-----+------------+|×|c|c_idx|c_idx_vec|+----+---+-----+------------+|1.0|一个|0.0|(3,[0],[1.0])||1.5|一个|0.0|(3,[0],[1.0])||10.0|乙|1.0|(3,[1],[1.0])||3.2|| |2.0|(3,[2],[1.0])|+----+---+-----+------------+//获取 c 及其对应的索引.一个热编码器会将它们放在向量中的相同索引上>>>colIdx = fl.select("c","c_idx").distinct().rdd.collectAsMap()>>>感冒{'c':2.0,'b':1.0,'a':0.0}>>>>>>colIdx = sorted((value, "ls_" + key) for (key, value) in colIdx.items())>>>感冒[(0.0, 'ls_a'), (1.0, 'ls_b'), (2.0, 'ls_c')]>>>>>>newCols = list(map(lambda x: x[1], colIdx))>>>actualCol = fl.columns>>>实际Col['x', 'c', 'c_idx', 'c_idx_vec']>>>allColNames = actualCol + newCols>>>所有列名['x', 'c', 'c_idx', 'c_idx_vec', 'ls_a', 'ls_b', 'ls_c']>>>>>>定义提取(行):...返回元组(map(lambda x: row[x], row.__fields__)) + 元组(row.c_idx_vec.toArray().tolist())...>>>结果 = fl.rdd.map(extract).toDF(allColNames)>>>结果显示(20,假)+----+---+-----+------------+----+----+----+|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|+----+---+-----+------------+----+----+----+|1.0 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 ||1.5 |a |0.0 |(3,[0],[1.0])|1.0 |0.0 |0.0 ||10.0|b |1.0 |(3,[1],[1.0])|0.0 |1.0 |0.0 ||3.2 |c |2.0 |(3,[2],[1.0])|0.0 |0.0 |1.0 |+----+---+-----+------------+----+----+----+//将新列类型转换为 int>>>对于 newCols 中的 col:... result = result.withColumn(col, result[col].cast("int"))...>>>结果显示(20,假)+----+---+-----+------------+----+----+----+|x |c |c_idx|c_idx_vec |ls_a|ls_b|ls_c|+----+---+-----+------------+----+----+----+|1.0 |a |0.0 |(3,[0],[1.0])|1 |0 |0 ||1.5 |a |0.0 |(3,[0],[1.0])|1 |0 |0 ||10.0|b |1.0 |(3,[1],[1.0])|0 |1 |0 ||3.2 |c |2.0 |(3,[2],[1.0])|0 |0 |1 |+----+---+-----+------------+----+----+----+

希望这有帮助!!

I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example:

>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()

    +----+---+-----+
    |   x|  c|c_idx|
    +----+---+-----+
    | 1.0|  a|  0.0|
    | 1.5|  a|  0.0|
    |10.0|  b|  1.0|
    | 3.2|  c|  2.0|
    +----+---+-----+

By default, the OneHotEncoder will drop the last category:

>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
    +----+---+-----+-------------+
    |   x|  c|c_idx|    c_idx_vec|
    +----+---+-----+-------------+
    | 1.0|  a|  0.0|(2,[0],[1.0])|
    | 1.5|  a|  0.0|(2,[0],[1.0])|
    |10.0|  b|  1.0|(2,[1],[1.0])|
    | 3.2|  c|  2.0|    (2,[],[])|
    +----+---+-----+-------------+

Of course, this behavior can be changed:

>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()

    +----+---+-----+-------------+
    |   x|  c|c_idx|    c_idx_vec|
    +----+---+-----+-------------+
    | 1.0|  a|  0.0|(3,[0],[1.0])|
    | 1.5|  a|  0.0|(3,[0],[1.0])|
    |10.0|  b|  1.0|(3,[1],[1.0])|
    | 3.2|  c|  2.0|(3,[2],[1.0])|
    +----+---+-----+-------------+

So, I wanted to know how to convert my c_idx_vec vector into new dataframe as below:

解决方案

Here is what you can do:

>>> from pyspark.ml.feature import OneHotEncoder, StringIndexer
>>>
>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
|   x|  c|c_idx|
+----+---+-----+
| 1.0|  a|  0.0|
| 1.5|  a|  0.0|
|10.0|  b|  1.0|
| 3.2|  c|  2.0|
+----+---+-----+

>>>
>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> oe.setDropLast(False)
OneHotEncoder_49e58b281387d8dc0c6b
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(3,[0],[1.0])|
| 1.5|  a|  0.0|(3,[0],[1.0])|
|10.0|  b|  1.0|(3,[1],[1.0])|
| 3.2|  c|  2.0|(3,[2],[1.0])|
+----+---+-----+-------------+

// Get c and its repective index. One hot encoder will put those on same index in vector

>>> colIdx = fl.select("c","c_idx").distinct().rdd.collectAsMap()
>>> colIdx
{'c': 2.0, 'b': 1.0, 'a': 0.0}
>>>
>>> colIdx =  sorted((value, "ls_" + key) for (key, value) in colIdx.items())
>>> colIdx
[(0.0, 'ls_a'), (1.0, 'ls_b'), (2.0, 'ls_c')]
>>>
>>> newCols = list(map(lambda x: x[1], colIdx))
>>> actualCol = fl.columns
>>> actualCol
['x', 'c', 'c_idx', 'c_idx_vec']
>>> allColNames = actualCol + newCols
>>> allColNames
['x', 'c', 'c_idx', 'c_idx_vec', 'ls_a', 'ls_b', 'ls_c']
>>>
>>> def extract(row):
...     return tuple(map(lambda x: row[x], row.__fields__)) + tuple(row.c_idx_vec.toArray().tolist())
...
>>> result = fl.rdd.map(extract).toDF(allColNames)
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x   |c  |c_idx|c_idx_vec    |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a  |0.0  |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|1.5 |a  |0.0  |(3,[0],[1.0])|1.0 |0.0 |0.0 |
|10.0|b  |1.0  |(3,[1],[1.0])|0.0 |1.0 |0.0 |
|3.2 |c  |2.0  |(3,[2],[1.0])|0.0 |0.0 |1.0 |
+----+---+-----+-------------+----+----+----+

// Typecast new columns to int

>>> for col in newCols:
...     result = result.withColumn(col, result[col].cast("int"))
...
>>> result.show(20, False)
+----+---+-----+-------------+----+----+----+
|x   |c  |c_idx|c_idx_vec    |ls_a|ls_b|ls_c|
+----+---+-----+-------------+----+----+----+
|1.0 |a  |0.0  |(3,[0],[1.0])|1   |0   |0   |
|1.5 |a  |0.0  |(3,[0],[1.0])|1   |0   |0   |
|10.0|b  |1.0  |(3,[1],[1.0])|0   |1   |0   |
|3.2 |c  |2.0  |(3,[2],[1.0])|0   |0   |1   |
+----+---+-----+-------------+----+----+----+

Hope this helps!!

这篇关于pyspark - 将一次热编码后获得的稀疏向量转换为列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆