比较pyspark中的两个数据集 [英] Compare two datasets in pyspark
问题描述
我有 2 个数据集.
示例数据集 1:
id | model | first_name | last_name
-----------------------------------------------------------
1234 | 32 | 456765 | [456700,987565]
-----------------------------------------------------------
4539 | 20 | 123211 | [893456,123456]
-----------------------------------------------------------
有时列 first_name 和 last_name 之一是空的.
Some times one of the columns first_name and last_name is empty.
示例数据集 2:
number | matricule | name | model
----------------------------------------------------------
AA | 0009 | 456765 | 32
----------------------------------------------------------
AA | 0009 | 893456 | 32
----------------------------------------------------------
AA | 0009 | 456700 | 32
----------------------------------------------------------
AA | 0008 | 456700 | 32
----------------------------------------------------------
AA | 0008 | 987565 | 32
对于一个 matricule
,我们可以找到更多的 name
和 model
,就像我上面的例子一样.我应该怎么做:
For one matricule
we can find more name
and model
, like in my example just above.
What I should do:
对于数据集 1 中的每一行,我取 3 列:model、first_name 和 last_name,如果根据矩阵元素存在/匹配,则在数据集 2 中查找它们.
For each row from the Dataset 1, I take the 3 columns: model, first_name and last_name and look for them in Dataset 2 if exist / match according the matricule elements.
我应该比较:
按模型建模 ==> 如果模型(数据集 1)存在于模型(数据集 2)中 ==> 匹配
model by model ==> if model (dataset 1) exist in model (dataset 2) ==> match
如果名字中存在 first_name ==> 不匹配.如果名字中不存在 first_name ==> 匹配
if first_name exist in name ==> no match. If first_name not exist in name ==> match
如果名字中存在last_name ==> 匹配.当我有两个 last_name 值时,两者都应该存在于要匹配的数据集 2 的名称中.
if last_name exist in name ==> match. When I have two values of last_name, the both should exist in name of dataset 2 to be matched.
示例:
数据集 1 的第 1 行是:
Rows 1 from Dataset 1 is:
id | model | first_name | last_name
------------------------------------------------------
1234 | 32 | 456765 | [456700,987565]
对于数据集 2 中的矩阵 0009,我有:
For matricule 0009 in dataset 2, I have:
number | matricule | name | model
----------------------------------------------------------
AA | 0009 | 456765 | 32
----------------------------------------------------------
AA | 0009 | 893456 | 32
----------------------------------------------------------
AA | 0009 | 456700 | 32
所以:
first_name (456765) 在matriule =0009 ==> 不匹配时存在于数据集2的名称中
first_name (456765) is exist in name of dataset 2 when matriule =0009 ==> no match
last_name,只有 456700 存在 ==> 不匹配
last_name, only 456700 is exist ==> no match
模型 (32) 存在于数据集 2 的模型中 ==> 匹配
model (32) is exist in model of dataset 2 ==> match
所以我跳过了矩阵 0009.并通过比较数据集 1 中的第二行与矩阵 0008 的元素.
So I skip the matricule 0009. And pass to compare second line in dataset 1 with the elements of matricule 0008.
对于数据集 2 中的矩阵 0008,我有:
For matricule 0008 in dataset 2, I have:
----------------------------------------------------------
AA | 0008 | 456700 | 32
----------------------------------------------------------
AA | 0008 | 987565 | 32
我们总是在数据集 1 的第一行:
Always we are in the first rows of dataset 1:
当 matricule=0008 ==> 匹配时,数据集 2 的名称中不存在 first_name (456765)
first_name (456765) is not exist in name of dataset 2 when matricule=0008 ==> match
last_name,当matricule = 0008,==>匹配时,这两个值都存在于数据集2的名称中
last_name, the both values are exist in name of dataset 2 when matricule = 0008, ==> match
当矩阵=0008==>匹配时,模型存在于数据集2的模型中
model is exist in model of dataset 2 when matricule =0008==> match
当我找到所有匹配项时,我创建了一个新的数据集,其中包含:
When I find all match, I create a new dataset contain:
number | id | matricule
-----------------------------------
AA | 1234 | 0008
-----------------------------------
我希望我说的很清楚.有人可以帮助我.
I hope that I was clear. Someone can help me please.
推荐答案
您可以在匹配条件上使用 join.
You can use join on the conditions of matching.
首先,您可以按第二个 DataFrame 分组并将 name
列收集到一个列表中:
First, you can group by the second DataFrame and collect name
column into a list:
df2 = df2.groupBy("number", "model", "matricule").agg(collect_list("name").alias("names"))
f2.show(truncate=False)
#+------+-----+---------+------------------------+
#|number|model|matricule|names |
#+------+-----+---------+------------------------+
#|AA |32 |0009 |[456765, 893456, 456700]|
#|AA |32 |0008 |[456700, 987565] |
#+------+-----+---------+------------------------+
现在,加入 df1
和 df2
.对于条件 1 和 2,检查起来很简单.对于第三个,您可以使用 array_except
可从 Spark 2.4+ 获得(last_name
列中不应该有不在 names
中的元素,反之亦然):
Now, join df1
and df2
. For conditions 1 and 2, it is somehow simple to check.
For the third one, you can use array_except
avaliable from Spark 2.4+ (there should be no elements from last_name
column that are not in names
and vice-versa):
join_condition = (col("df1.model") == col("df2.model")) \
& ~expr("array_contains(df2.names, df1.first_name)") \
& (size(expr("array_except(df2.names, df1.last_name)")) == lit(0)) \
& (size(expr("array_except(df1.last_name, df2.names)")) == lit(0))
df_result = df1.alias("df1").join(df2.alias("df2"), join_condition)
最后,从连接结果中选择所需的列:
Finally, select desired columns from the join result:
df_result.select("number", "id", "matricule").show(truncate=False)
#+------+----+---------+
#|number|id |matricule|
#+------+----+---------+
#|AA |1234|0008 |
#+------+----+---------+
这篇关于比较pyspark中的两个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!