根据列值是否在另一列中,将列添加到 PySpark DataFrame [英] Adding column to PySpark DataFrame depending on whether column value is in another column
问题描述
我有一个 PySpark DataFrame,其结构由
I have a PySpark DataFrame with structure given by
[('u1', 1, [1 ,2, 3]), ('u1', 4, [1, 2, 3])].toDF('user', 'item', 'fav_items')
根据item"是否在fav_items"中,我需要再添加一个 1 或 0 列.
I need to add a further column with 1 or 0 depending on whether 'item' is in 'fav_items' or not.
所以我想要
[('u1', 1, [1 ,2, 3], 1), ('u1', 4, [1, 2, 3], 0)]
我将如何查找第二列到第三列以确定值,然后我将如何添加它?
How would I look up for second column into third column to decide value and how would I then add it?
推荐答案
以下代码执行请求的任务.定义了一个用户定义的函数,它接收 DataFrame
的两列作为参数.因此,对于每一行,搜索一个项目是否在项目列表中.如果找到该项目,则返回 1,否则返回 0.
The following code does the requested task. An user defined function was defined that receives two columns of a DataFrame
as parameters. So, for each row, search if an item is in the item list. If the item is found, a 1 is return, otherwise a 0.
# Imports
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
# First we create a RDD in order to create a dataFrame:
rdd = sc.parallelize([('u1', 1, [1 ,2, 3]), ('u1', 4, [1, 2, 3])])
df = rdd.toDF(['user', 'item', 'fav_items'])
# Print dataFrame
df.show()
# We make an user define function that receives two columns and do operation
function = udf(lambda item, items: 1 if item in items else 0, IntegerType())
df.select('user', 'item', 'fav_items', function(col('item'), col('fav_items')).alias('result')).show()
结果如下:
+----+----+---------+
|user|item|fav_items|
+----+----+---------+
| u1| 1|[1, 2, 3]|
| u1| 4|[1, 2, 3]|
+----+----+---------+
+----+----+---------+------+
|user|item|fav_items|result|
+----+----+---------+------+
| u1| 1|[1, 2, 3]| 1|
| u1| 4|[1, 2, 3]| 0|
+----+----+---------+------+
这篇关于根据列值是否在另一列中,将列添加到 PySpark DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!