比较两个数组并获得PySpark的差异 [英] Comparing two arrays and getting the difference in PySpark
问题描述
我在一个数据框中有两个数组字段.
I have two array fields in a data frame.
我需要比较这两个数组,并获得同一数据帧中的数组(新列)的差异.
I have a requirement to compare these two arrays and get the difference as an array(new column) in the same data frame.
预期输出为:
B列是A列的子集.两个数组中的单词也将以相同的顺序出现.
Column B is a subset of column A. Also the words is going to be in the same order in both arrays.
有人可以帮我解决这个问题吗?
Can any one please help me to get a solution for this?
推荐答案
您可以使用用户定义的函数.我的示例数据框与您的示例数据框有些不同,但是代码应该可以正常工作:
You can use a user-defined function. My example dataframe differs a bit from yours, but the code should work fine:
import pandas as pd
from pyspark.sql.types import *
#example df
df=sqlContext.createDataFrame(pd.DataFrame(data=[[["hello", "world"],
["world"]],[["sample", "overflow", "text"], ["sample", "text"]]], columns=["A", "B"]))
# define udf
differencer=udf(lambda x,y: list(set(x)-set(y)), ArrayType(StringType()))
df=df.withColumn('difference', differencer('A', 'B'))
如果重复项不起作用,则该集合仅保留唯一性.因此,您可以按以下方式修改udf:
This does not work if there are duplicates as set retains only uniques. So you can amend the udf as follows:
differencer=udf(lambda x,y: [elt for elt in x if elt not in y] ), ArrayType(StringType()))
这篇关于比较两个数组并获得PySpark的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!