比较pyspark中的两个数据集 [英] Compare two datasets in pyspark
问题描述
我有2个数据集.
示例数据集1:
id | model | first_name | last_name
-----------------------------------------------------------
1234 | 32 | 456765 | [456700,987565]
-----------------------------------------------------------
4539 | 20 | 123211 | [893456,123456]
-----------------------------------------------------------
有时first_name和last_name列之一为空.
Some times one of the columns first_name and last_name is empty.
示例数据集2:
number | matricule | name | model
----------------------------------------------------------
AA | 0009 | 456765 | 32
----------------------------------------------------------
AA | 0009 | 893456 | 32
----------------------------------------------------------
AA | 0009 | 456700 | 32
----------------------------------------------------------
AA | 0008 | 456700 | 32
----------------------------------------------------------
AA | 0008 | 987565 | 32
对于一个matricule
,我们可以找到更多的name
和model
,就像上面的示例一样.
我应该怎么做:
For one matricule
we can find more name
and model
, like in my example just above.
What I should do:
对于数据集1中的每一行,我提取3列:模型,first_name和last_name,并在数据集2中查找它们(如果存在/根据矩阵元素匹配).
For each row from the Dataset 1, I take the 3 columns: model, first_name and last_name and look for them in Dataset 2 if exist / match according the matricule elements.
我应该比较:
-
按模型划分的模型==>如果模型(数据集1)存在于模型(数据集2)==>匹配中
model by model ==> if model (dataset 1) exist in model (dataset 2) ==> match
如果名字中存在名字==>则不匹配.如果名字中不存在first_name ==>匹配
if first_name exist in name ==> no match. If first_name not exist in name ==> match
如果在name ==>匹配项中存在last_name.当我有两个last_name值时,这两个值都应存在于要匹配的数据集2的名称中.
if last_name exist in name ==> match. When I have two values of last_name, the both should exist in name of dataset 2 to be matched.
示例:
数据集1中的行1为:
id | model | first_name | last_name
------------------------------------------------------
1234 | 32 | 456765 | [456700,987565]
对于数据集2中的矩阵0009,我有:
For matricule 0009 in dataset 2, I have:
number | matricule | name | model
----------------------------------------------------------
AA | 0009 | 456765 | 32
----------------------------------------------------------
AA | 0009 | 893456 | 32
----------------------------------------------------------
AA | 0009 | 456700 | 32
所以:
first_name(456765)
first_name (456765) is exist in name of dataset 2 when matriule =0009 ==> no match
姓氏,仅存在456700 ==>不匹配
last_name, only 456700 is exist ==> no match
模型(32)==>匹配
model (32) is exist in model of dataset 2 ==> match
因此,我跳过了矩阵0009.然后将数据集1中的第二行与矩阵0008的元素进行比较.
So I skip the matricule 0009. And pass to compare second line in dataset 1 with the elements of matricule 0008.
对于数据集2中的矩阵0008,我有:
For matricule 0008 in dataset 2, I have:
----------------------------------------------------------
AA | 0008 | 456700 | 32
----------------------------------------------------------
AA | 0008 | 987565 | 32
我们总是在数据集1的第一行中
Always we are in the first rows of dataset 1:
first_name(456765)
first_name (456765) is not exist in name of dataset 2 when matricule=0008 ==> match
last_name,当矩阵= 0008,==>匹配时,两个值都存在于数据集2的名称中
last_name, the both values are exist in name of dataset 2 when matricule = 0008, ==> match
模型
当我找到所有匹配项时,我将创建一个包含以下内容的新数据集:
When I find all match, I create a new dataset contain:
number | id | matricule
-----------------------------------
AA | 1234 | 0008
-----------------------------------
我希望我很清楚.有人可以帮我.
I hope that I was clear. Someone can help me please.
推荐答案
您可以在符合条件的情况下使用join.
You can use join on the conditions of matching.
首先,您可以将第二个DataFrame分组,并将name
列收集到一个列表中:
First, you can group by the second DataFrame and collect name
column into a list:
df2 = df2.groupBy("number", "model", "matricule").agg(collect_list("name").alias("names"))
f2.show(truncate=False)
#+------+-----+---------+------------------------+
#|number|model|matricule|names |
#+------+-----+---------+------------------------+
#|AA |32 |0009 |[456765, 893456, 456700]|
#|AA |32 |0008 |[456700, 987565] |
#+------+-----+---------+------------------------+
现在,加入df1
和df2
.对于条件1和2,检查起来有点简单.
对于第三个,您可以使用 array_except
可从Spark 2.4+开始使用(last_name
列中不应有names
中没有的元素,反之亦然):
Now, join df1
and df2
. For conditions 1 and 2, it is somehow simple to check.
For the third one, you can use array_except
avaliable from Spark 2.4+ (there should be no elements from last_name
column that are not in names
and vice-versa):
join_condition = (col("df1.model") == col("df2.model")) \
& ~expr("array_contains(df2.names, df1.first_name)") \
& (size(expr("array_except(df2.names, df1.last_name)")) == lit(0)) \
& (size(expr("array_except(df1.last_name, df2.names)")) == lit(0))
df_result = df1.alias("df1").join(df2.alias("df2"), join_condition)
最后,从联接结果中选择所需的列:
Finally, select desired columns from the join result:
df_result.select("number", "id", "matricule").show(truncate=False)
#+------+----+---------+
#|number|id |matricule|
#+------+----+---------+
#|AA |1234|0008 |
#+------+----+---------+
这篇关于比较pyspark中的两个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!