Pyspark 数据框:根据另一列的值提取一列 [英] Pyspark dataframes: Extract a column based on the value of another column

查看:129
本文介绍了Pyspark 数据框:根据另一列的值提取一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含以下列和相应值的数据框(请原谅我的格式,但不知道如何将其放入表格格式):

I have a dataframe with the following columns and corresponding values (forgive my formatting but dont know how to put it in table format):

Src_ip     dst_ip     V1     V2     V3     top
"A"         "B"       xx     yy     zz     "V1"

现在我想添加一列,比如说 top_value 它采用与 V1 中的字符串对应的列的值.

Now I want to add a column, lets say top_value which takes the value of column corresponding to the string in V1.

Src_ip     dst_ip     V1     V2     V3     top   top_value
"A"         "B"       xx     yy     zz     "V1"     xx

所以基本上,获取与top"列中的值对应的值,并创建一个名为top_value"的新列

So basically, get the value corresponding to the value in the column "top" and make a new column named "top_value"

我尝试过创建 UDF 以及使用字符串作为别名,但无法这样做.任何人都可以帮忙.

I have tried by creating UDFs as well as using the string as an alias but unable to do so. Can anyone please help.

推荐答案

您可以将 V1V2V3 列收集为 struct 并通过 top 列传递给 udf 函数并将值提取为

You can collect the V1, V2 and V3 columns as struct and pass to a udf function with the top column and extract the value as

import org.apache.spark.sql.functions._
def findValueUdf = udf((strct: Row, top: String) => strct.getAs[String](top))

df.withColumn("top_value", findValueUdf(struct("V1", "V2", "V3"), col("top")))

应该给你

+------+------+---+---+---+---+---------+
|Src_ip|dst_ip|V1 |V2 |V3 |top|top_value|
+------+------+---+---+---+---+---------+
|A     |B     |xx |yy |zz |V1 |xx       |
+------+------+---+---+---+---+---------+

pyspark

pyspark 中的等效代码是

pyspark

equivalent code in pyspark would be

from pyspark.sql import functions as f
from pyspark.sql import types as t
def findValueUdf(strct, top):
    return strct[top]

FVUdf = f.udf(findValueUdf, t.StringType())

df.withColumn("top_value", FVUdf(f.struct("V1", "V2", "V3"), f.col("top")))

此外,您可以在列表中定义要在 struct 函数中使用的列名称,这样您就不必对其进行硬编码.

Moreover you can define the column names in a list to be used in struct function so that you don't have to hard code them.

希望回答对你有帮助

这篇关于Pyspark 数据框:根据另一列的值提取一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆