Pyspark 数据框:根据另一列的值提取一列 [英] Pyspark dataframes: Extract a column based on the value of another column

查看：129 发布时间：2021/6/24 20:39:09 apache-spark pyspark

本文介绍了Pyspark 数据框:根据另一列的值提取一列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含以下列和相应值的数据框(请原谅我的格式，但不知道如何将其放入表格格式):

I have a dataframe with the following columns and corresponding values (forgive my formatting but dont know how to put it in table format):

Src_ip     dst_ip     V1     V2     V3     top
"A"         "B"       xx     yy     zz     "V1"

现在我想添加一列，比如说 top_value 它采用与 V1 中的字符串对应的列的值.

Now I want to add a column, lets say top_value which takes the value of column corresponding to the string in V1.

Src_ip     dst_ip     V1     V2     V3     top   top_value
"A"         "B"       xx     yy     zz     "V1"     xx

所以基本上，获取与top"列中的值对应的值，并创建一个名为top_value"的新列

So basically, get the value corresponding to the value in the column "top" and make a new column named "top_value"

我尝试过创建 UDF 以及使用字符串作为别名，但无法这样做.任何人都可以帮忙.

I have tried by creating UDFs as well as using the string as an alias but unable to do so. Can anyone please help.

推荐答案

您可以将 V1、V2 和 V3 列收集为 struct 并通过 top 列传递给 udf 函数并将值提取为

You can collect the V1, V2 and V3 columns as struct and pass to a udf function with the top column and extract the value as

import org.apache.spark.sql.functions._
def findValueUdf = udf((strct: Row, top: String) => strct.getAs[String](top))

df.withColumn("top_value", findValueUdf(struct("V1", "V2", "V3"), col("top")))

应该给你

+------+------+---+---+---+---+---------+
|Src_ip|dst_ip|V1 |V2 |V3 |top|top_value|
+------+------+---+---+---+---+---------+
|A     |B     |xx |yy |zz |V1 |xx       |
+------+------+---+---+---+---+---------+

pyspark

pyspark 中的等效代码是

pyspark

equivalent code in pyspark would be

from pyspark.sql import functions as f
from pyspark.sql import types as t
def findValueUdf(strct, top):
    return strct[top]

FVUdf = f.udf(findValueUdf, t.StringType())

df.withColumn("top_value", FVUdf(f.struct("V1", "V2", "V3"), f.col("top")))

此外，您可以在列表中定义要在 struct 函数中使用的列名称，这样您就不必对其进行硬编码.

Moreover you can define the column names in a list to be used in struct function so that you don't have to hard code them.

希望回答对你有帮助

这篇关于Pyspark 数据框:根据另一列的值提取一列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark 数据框:根据另一列的值提取一列 [英] Pyspark dataframes: Extract a column based on the value of another column

问题描述

推荐答案

pyspark

pyspark

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Pyspark 数据框:根据另一列的值提取一列 [英] Pyspark dataframes: Extract a column based on the value of another column

问题描述

推荐答案

pyspark

pyspark

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭