E-num/在 pyspark 中获取傻瓜 [英] E-num / get Dummies in pyspark
问题描述
我想在 PYSPARK 中创建一个函数来获取数据框和参数列表(代码/分类特征)并返回带有附加虚拟列的数据框,例如列表中特征的类别PFA DF 前后:之前和之后的数据框-示例
I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list PFA the Before and After DF: before and After data frame- Example
python 中的代码如下所示:
The code in python looks like that:
enum = ['column1','column2']
for e in enum:
print e
temp = pd.get_dummies(data[e],drop_first=True,prefix=e)
data = pd.concat([data,temp], axis=1)
data.drop(e,axis=1,inplace=True)
data.to_csv('enum_data.csv')
推荐答案
首先你需要收集 TYPES
和 CODE
的不同值.然后使用 withColumn
选择添加带有每个值名称的列,或者使用 select 来选择每列.以下是使用 select 语句的示例代码:-
First you need to collect distinct values of TYPES
and CODE
. Then either select add column with name of each value using withColumn
or use select fro each column.
Here is sample code using select statement:-
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
(1, "A", "X1"),
(2, "B", "X2"),
(3, "B", "X3"),
(1, "B", "X3"),
(2, "C", "X2"),
(3, "C", "X2"),
(1, "C", "X1"),
(1, "B", "X1"),
], ["ID", "TYPE", "CODE"])
types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
df.show()
输出
+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
| 1| A| X1| 1| 0| 0| 1| 0| 0|
| 2| B| X2| 0| 1| 0| 0| 1| 0|
| 3| B| X3| 0| 1| 0| 0| 0| 1|
| 1| B| X3| 0| 1| 0| 0| 0| 1|
| 2| C| X2| 0| 0| 1| 0| 1| 0|
| 3| C| X2| 0| 0| 1| 0| 1| 0|
| 1| C| X1| 0| 0| 1| 1| 0| 0|
| 1| B| X1| 0| 1| 0| 1| 0| 0|
+---+----+----+--------+--------+--------+---------+---------+---------+
这篇关于E-num/在 pyspark 中获取傻瓜的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!