比较两个数组并获得 PySpark 中的差异 [英] Comparing two arrays and getting the difference in PySpark

查看：30 发布时间：2021/11/14 21:00:01 python pyspark apache-spark-sql spark-dataframe apache-spark-mllib

本文介绍了比较两个数组并获得 PySpark 中的差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在一个数据框中有两个数组字段.

I have two array fields in a data frame.

我需要比较这两个数组并在同一数据框中将差异作为数组(新列)获取.

I have a requirement to compare these two arrays and get the difference as an array(new column) in the same data frame.

预期输出为:

B 列是 A 列的子集.此外，两个数组中的单词顺序相同.

Column B is a subset of column A. Also the words is going to be in the same order in both arrays.

有人可以帮我找到解决方案吗?

Can any one please help me to get a solution for this?

推荐答案

您可以使用用户定义的函数.我的示例数据帧与您的略有不同，但代码应该可以正常工作:

You can use a user-defined function. My example dataframe differs a bit from yours, but the code should work fine:

import pandas as pd
from pyspark.sql.types import *

#example df
df=sqlContext.createDataFrame(pd.DataFrame(data=[[["hello", "world"], 
["world"]],[["sample", "overflow", "text"], ["sample", "text"]]], columns=["A", "B"]))

# define udf
differencer=udf(lambda x,y: list(set(x)-set(y)), ArrayType(StringType()))
df=df.withColumn('difference', differencer('A', 'B'))

如果存在重复项，这将不起作用，因为 set 仅保留唯一项.所以你可以修改udf如下:

This does not work if there are duplicates as set retains only uniques. So you can amend the udf as follows:

differencer=udf(lambda x,y: [elt for elt in x if elt not in y] ), ArrayType(StringType()))

这篇关于比较两个数组并获得 PySpark 中的差异的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

比较两个数组并获得 PySpark 中的差异 [英] Comparing two arrays and getting the difference in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

比较两个数组并获得 PySpark 中的差异 [英] Comparing two arrays and getting the difference in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭