根据来自另一个 pyspark 数据帧的单词从 pyspark 数据帧中删除单词 [英] Remove words from pyspark dataframe based on words from another pyspark dataframe

查看:37
本文介绍了根据来自另一个 pyspark 数据帧的单词从 pyspark 数据帧中删除单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从辅助数据框中删除主数据框中的单词.

I want to remove the words in main data frame from secondary data frame.

这是主要的数据框:

+----------+--------------------+
|  event_dt|           cust_text|
+----------+--------------------+
|2020-09-02|hi fine i want to go|
|2020-09-02|i need  a line hold |
|2020-09-02|i have the  60 packs|
|2020-09-02|hello want you teach|

下面是单列辅助数据框.辅助数据框中的单词需要从 cust_text 列中的主数据框中删除,无论这些单词出现在哪里.例如,'want' 将从每一行中删除它出现在主数据框中的任何位置(在本例中将从第 1 行和第 4 行中删除).

Below is single-column secondary data frame. The words in the secondary data frame need to be removed from the main data frame in column cust_text wherever the words occur. For example, 'want' will be removed from every row wherever it shows up in the main data frame (in this example will be removed from 1st and 4th row).

+-------+
|column1|
+-------+
|   want|
|because|
|   need|
|  hello|
|      a|
|   have|
|     go|
+-------+

event_dt 列将保持原样,每一行将保持原样,只有辅助数据框字从结果数据框中的主数据框中删除,如下所示

The event_dt column will remain as is and each row will remain as is, only the secondary data frame words are removed from main data frame in the result data frame as shown below

+----------+--------------------+
|  event_dt|           cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to        |
|2020-09-02|i line hold         |
|2020-09-02|i the 60 packs      |
|2020-09-02|you teach           |
+----------+--------------------+

感谢帮助!!

推荐答案

这应该是适合您的解决方案 - 使用 array_except() 以消除不需要的字符串,但是为了做到这一点,我们需要做一些准备.

This should be the working solution for you - Use array_except() in order to eliminate the unwanted strings, however in order to do that, we need to do a little bit of preparation.

from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([("2020-09-02","hi fine i want to go"),("2020-09-02","i need  a line hold"), ("2020-09-02", "i have the  60 packs"), ("2020-09-02", "hello want you teach")],[ "col1","col2"])

将该列设为数组以备将来使用

df = df.withColumn("col2", F.split("col2", " "))
df.show(truncate=False)
df_lookup = spark.createDataFrame([(1,"want"),(1,"because"), (1, "need"), (1, "hello"),(1, "a"),(1, "give"), (1, "go")],[ "col1","col2"])
df_lookup.show()

输出

+----------+---------------------------+
|col1      |col2                       |
+----------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|
|2020-09-02|[i, need, , a, line, hold] |
|2020-09-02|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach]  |
+----------+---------------------------+

+----+-------+
|col1|   col2|
+----+-------+
|   1|   want|
|   1|because|
|   1|   need|
|   1|  hello|
|   1|      a|
|   1|   give|
|   1|     go|
+----+-------+

现在,只需将查找数据框分组并获取变量中的所有查找值,如下所示

df_lookup_var = df_lookup.groupBy("col1").agg(F.collect_set("col2").alias("col2")).collect()[0][1]
print(df_lookup_var)
x = ",".join(df_lookup_var)
print(x)
df = df.withColumn("filter_col", F.lit(x))
df = df.withColumn("filter_col", F.split("filter_col", ","))
df.show(truncate=False)

这个就行了

df = df.withColumn("ArrayColumn", F.array_except("col2", "filter_col"))
df.show(truncate = False)
+----------+---------------------------+-----------------------------------------+---------------------------+
|col1      |col2                       |filter_col                               |ArrayColumn                |
+----------+---------------------------+-----------------------------------------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|[need, want, a, because, hello, give, go]|[hi, fine, i, to]          |
|2020-09-02|[i, need, , a, line, hold] |[need, want, a, because, hello, give, go]|[i, , line, hold]          |
|2020-09-02|[i, have, the, , 60, packs]|[need, want, a, because, hello, give, go]|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach]  |[need, want, a, because, hello, give, go]|[you, teach]               |
+----------+---------------------------+-----------------------------------------+---------------------------+

这篇关于根据来自另一个 pyspark 数据帧的单词从 pyspark 数据帧中删除单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆