比较pyspark中的两个数据集 [英] Compare two datasets in pyspark

查看:38
本文介绍了比较pyspark中的两个数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个数据集.

示例数据集 1:

id     |   model |   first_name   |      last_name
-----------------------------------------------------------
1234   |   32    |    456765      |   [456700,987565]
-----------------------------------------------------------
4539   |   20    |    123211      |   [893456,123456]
-----------------------------------------------------------

有时列 first_name 和 last_name 之一是空的.

Some times one of the columns first_name and last_name is empty.

示例数据集 2:

number  |  matricule   | name       |    model
----------------------------------------------------------
AA      |  0009        |  456765    |     32
----------------------------------------------------------
AA      |  0009        |  893456    |     32
----------------------------------------------------------
AA      |  0009        |  456700    |     32
----------------------------------------------------------
AA      |  0008        |  456700    |     32
----------------------------------------------------------
AA      |  0008        |  987565    |     32

对于一个 matricule,我们可以找到更多的 namemodel,就像我上面的例子一样.我应该怎么做:

For one matricule we can find more name and model, like in my example just above. What I should do:

对于数据集 1 中的每一行,我取 3 列:model、first_name 和 last_name,如果根据矩阵元素存在/匹配,则在数据集 2 中查找它们.

For each row from the Dataset 1, I take the 3 columns: model, first_name and last_name and look for them in Dataset 2 if exist / match according the matricule elements.

我应该比较:

  • 按模型建模 ==> 如果模型(数据集 1)存在于模型(数据集 2)中 ==> 匹配

  • model by model ==> if model (dataset 1) exist in model (dataset 2) ==> match

如果名字中存在 first_name ==> 不匹配.如果名字中不存在 first_name ==> 匹配

if first_name exist in name ==> no match. If first_name not exist in name ==> match

如果名字中存在last_name ==> 匹配.当我有两个 last_name 值时,两者都应该存在于要匹配的数据集 2 的名称中.

if last_name exist in name ==> match. When I have two values of last_name, the both should exist in name of dataset 2 to be matched.

示例:

数据集 1 的第 1 行是:

Rows 1 from Dataset 1 is:

id     |   model |   first_name   |      last_name
------------------------------------------------------
1234   |   32    |    456765      |   [456700,987565]

对于数据集 2 中的矩阵 0009,我有:

For matricule 0009 in dataset 2, I have:

number  |  matricule   | name       |    model
----------------------------------------------------------
AA      |  0009        |  456765    |     32
----------------------------------------------------------
AA      |  0009        |  893456    |     32
----------------------------------------------------------
AA      |  0009        |  456700    |     32

所以:

first_name (456765) 在matriule =00​​09 ==> 不匹配时存在于数据集2的名称中

first_name (456765) is exist in name of dataset 2 when matriule =0009 ==> no match

last_name,只有 456700 存在 ==> 不匹配

last_name, only 456700 is exist ==> no match

模型 (32) 存在于数据集 2 的模型中 ==> 匹配

model (32) is exist in model of dataset 2 ==> match

所以我跳过了矩阵 0009.并通过比较数据集 1 中的第二行与矩阵 0008 的元素.

So I skip the matricule 0009. And pass to compare second line in dataset 1 with the elements of matricule 0008.

对于数据集 2 中的矩阵 0008,我有:

For matricule 0008 in dataset 2, I have:

----------------------------------------------------------
AA      |  0008        |  456700    |     32
----------------------------------------------------------
AA      |  0008        |  987565    |     32

我们总是在数据集 1 的第一行:

Always we are in the first rows of dataset 1:

当 matricule=0008 ==> 匹配时,数据集 2 的名称中不存在 first_name (456765)

first_name (456765) is not exist in name of dataset 2 when matricule=0008 ==> match

last_name,当matricule = 0008,==>匹配时,这两个值都存在于数据集2的名称中

last_name, the both values are exist in name of dataset 2 when matricule = 0008, ==> match

当矩阵=0008==>匹配时,模型存在于数据集2的模型中

model is exist in model of dataset 2 when matricule =0008==> match

当我找到所有匹配项时,我创建了一个新的数据集,其中包含:

When I find all match, I create a new dataset contain:

number | id     |  matricule
-----------------------------------
AA     | 1234   | 0008
-----------------------------------

我希望我说的很清楚.有人可以帮助我.

I hope that I was clear. Someone can help me please.

推荐答案

您可以在匹配条件上使用 join.

You can use join on the conditions of matching.

首先,您可以按第二个 DataFrame 分组并将 name 列收集到一个列表中:

First, you can group by the second DataFrame and collect name column into a list:

df2 = df2.groupBy("number", "model", "matricule").agg(collect_list("name").alias("names"))
f2.show(truncate=False)

#+------+-----+---------+------------------------+
#|number|model|matricule|names                   |
#+------+-----+---------+------------------------+
#|AA    |32   |0009     |[456765, 893456, 456700]|
#|AA    |32   |0008     |[456700, 987565]        |
#+------+-----+---------+------------------------+

现在,加入 df1df2.对于条件 1 和 2,检查起来很简单.对于第三个,您可以使用 array_except 可从 Spark 2.4+ 获得(last_name 列中不应该有不在 names 中的元素,反之亦然):

Now, join df1 and df2. For conditions 1 and 2, it is somehow simple to check. For the third one, you can use array_except avaliable from Spark 2.4+ (there should be no elements from last_name column that are not in names and vice-versa):

join_condition = (col("df1.model") == col("df2.model")) \
                 & ~expr("array_contains(df2.names, df1.first_name)") \
                 & (size(expr("array_except(df2.names, df1.last_name)")) == lit(0)) \
                 & (size(expr("array_except(df1.last_name, df2.names)")) == lit(0))


df_result = df1.alias("df1").join(df2.alias("df2"), join_condition)

最后,从连接结果中选择所需的列:

Finally, select desired columns from the join result:

df_result.select("number", "id", "matricule").show(truncate=False)

#+------+----+---------+
#|number|id  |matricule|
#+------+----+---------+
#|AA    |1234|0008     |
#+------+----+---------+

这篇关于比较pyspark中的两个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆