Pyspark:我想手动映射数据框中一列的值 [英] Pyspark: I want to manually map the values of one of the columns in my dataframe
问题描述
我在 spark 中有一个数据框,我想手动映射其中一列的值:
I have a dataframe in spark and I want to manually map the values of one of the columns:
Col1
Y
N
N
Y
N
Y
我希望Y"等于 1,N"等于 0,就像这样:
I want "Y" to be equal to 1 and "N" to be equal to 0, like this:
Col1
1
0
0
1
0
1
我尝试过 StringIndexer,但我认为它会随机编码分类数据.(我不确定)
I have tried StringIndexer, but it I think it randomly encodes the categorical data. (I am not sure)
与此等效的 python 是:
The python equivalent for this is:
df["Col1"] = df["Col1"].map({"Y": 1, "N": 0})
能否请您帮助我了解如何在 Pyspark 中实现这一目标?
Can you please help me on how can I achieve this in Pyspark?
推荐答案
由于您想将值映射到 1
和 0
,一个简单的方法是指定一个布尔条件并将结果转换为 int
>
Since you want to map the values to 1
and 0
, an easy way is to specify a boolean condition and cast the result to int
from pyspark.sql.functions import col
df.withColumn("Col1", (col("Col1")=="Y").cast("int"))
对于更一般的情况,您可以使用 pyspark.sql.functions.when
来实现 if-then-else 逻辑:
For a more general case, you can use pyspark.sql.functions.when
to implement if-then-else logic:
from pyspark.sql.functions import when
df.withColumn("Col1", when(col("Col1").isin(["Y"]), 1).otherwise(0))
这篇关于Pyspark:我想手动映射数据框中一列的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!