Pyspark 数据框:根据另一列的值提取一列 [英] Pyspark dataframes: Extract a column based on the value of another column
问题描述
我有一个包含以下列和相应值的数据框(请原谅我的格式,但不知道如何将其放入表格格式):
I have a dataframe with the following columns and corresponding values (forgive my formatting but dont know how to put it in table format):
Src_ip dst_ip V1 V2 V3 top
"A" "B" xx yy zz "V1"
现在我想添加一列,比如说 top_value
它采用与 V1 中的字符串对应的列的值.
Now I want to add a column, lets say top_value
which takes the value of column corresponding to the string in V1.
Src_ip dst_ip V1 V2 V3 top top_value
"A" "B" xx yy zz "V1" xx
所以基本上,获取与top"列中的值对应的值,并创建一个名为top_value"的新列
So basically, get the value corresponding to the value in the column "top" and make a new column named "top_value"
我尝试过创建 UDF 以及使用字符串作为别名,但无法这样做.任何人都可以帮忙.
I have tried by creating UDFs as well as using the string as an alias but unable to do so. Can anyone please help.
推荐答案
您可以将 V1
、V2
和 V3
列收集为 struct
并通过 top
列传递给 udf
函数并将值提取为
You can collect the V1
, V2
and V3
columns as struct
and pass to a udf
function with the top
column and extract the value as
import org.apache.spark.sql.functions._
def findValueUdf = udf((strct: Row, top: String) => strct.getAs[String](top))
df.withColumn("top_value", findValueUdf(struct("V1", "V2", "V3"), col("top")))
应该给你
+------+------+---+---+---+---+---------+
|Src_ip|dst_ip|V1 |V2 |V3 |top|top_value|
+------+------+---+---+---+---+---------+
|A |B |xx |yy |zz |V1 |xx |
+------+------+---+---+---+---+---------+
pyspark
pyspark 中的等效代码是
pyspark
equivalent code in pyspark would be
from pyspark.sql import functions as f
from pyspark.sql import types as t
def findValueUdf(strct, top):
return strct[top]
FVUdf = f.udf(findValueUdf, t.StringType())
df.withColumn("top_value", FVUdf(f.struct("V1", "V2", "V3"), f.col("top")))
此外,您可以在列表中定义要在 struct
函数中使用的列名称,这样您就不必对其进行硬编码.
Moreover you can define the column names in a list to be used in struct
function so that you don't have to hard code them.
希望回答对你有帮助
这篇关于Pyspark 数据框:根据另一列的值提取一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!