E-num/在pyspark中获取假人 [英] E-num / get Dummies in pyspark

查看:98
本文介绍了E-num/在pyspark中获取假人的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 PYSPARK 中创建一个函数,该函数获取数据框和参数列表(代码/分类特征),并返回带有附加虚拟列的数据框,例如列表中特征的类别 PFA DF之前和之后: 数据框前后-示例

I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list PFA the Before and After DF: before and After data frame- Example

python中的代码如下:

The code in python looks like that:

enum = ['column1','column2']

for e in enum:
    print e
    temp = pd.get_dummies(data[e],drop_first=True,prefix=e)
    data = pd.concat([data,temp], axis=1)
    data.drop(e,axis=1,inplace=True)

data.to_csv('enum_data.csv')

推荐答案

首先,您需要收集TYPESCODE的不同值.然后使用withColumn选择添加每个值名称的列,或使用select每个列. 这是使用select语句的示例代码:-

First you need to collect distinct values of TYPES and CODE. Then either select add column with name of each value using withColumn or use select fro each column. Here is sample code using select statement:-

import pyspark.sql.functions as F
df = sqlContext.createDataFrame([
    (1, "A", "X1"),
    (2, "B", "X2"),
    (3, "B", "X3"),
    (1, "B", "X3"),
    (2, "C", "X2"),
    (3, "C", "X2"),
    (1, "C", "X1"),
    (1, "B", "X1"),
], ["ID", "TYPE", "CODE"])

types = df.select("TYPE").distinct().rdd.flatMap(lambda x: x).collect()
codes = df.select("CODE").distinct().rdd.flatMap(lambda x: x).collect()
types_expr = [F.when(F.col("TYPE") == ty, 1).otherwise(0).alias("e_TYPE_" + ty) for ty in types]
codes_expr = [F.when(F.col("CODE") == code, 1).otherwise(0).alias("e_CODE_" + code) for code in codes]
df = df.select("ID", "TYPE", "CODE", *types_expr+codes_expr)
df.show()

输出

+---+----+----+--------+--------+--------+---------+---------+---------+
| ID|TYPE|CODE|e_TYPE_A|e_TYPE_B|e_TYPE_C|e_CODE_X1|e_CODE_X2|e_CODE_X3|
+---+----+----+--------+--------+--------+---------+---------+---------+
|  1|   A|  X1|       1|       0|       0|        1|        0|        0|
|  2|   B|  X2|       0|       1|       0|        0|        1|        0|
|  3|   B|  X3|       0|       1|       0|        0|        0|        1|
|  1|   B|  X3|       0|       1|       0|        0|        0|        1|
|  2|   C|  X2|       0|       0|       1|        0|        1|        0|
|  3|   C|  X2|       0|       0|       1|        0|        1|        0|
|  1|   C|  X1|       0|       0|       1|        1|        0|        0|
|  1|   B|  X1|       0|       1|       0|        1|        0|        0|
+---+----+----+--------+--------+--------+---------+---------+---------+

这篇关于E-num/在pyspark中获取假人的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆