根据来自另一个 pyspark 数据帧的单词从 pyspark 数据帧中删除单词 [英] Remove words from pyspark dataframe based on words from another pyspark dataframe

查看：37 发布时间：2021/6/25 18:33:42 python dataframe apache-spark pyspark

本文介绍了根据来自另一个 pyspark 数据帧的单词从 pyspark 数据帧中删除单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从辅助数据框中删除主数据框中的单词.

I want to remove the words in main data frame from secondary data frame.

这是主要的数据框:

+----------+--------------------+
|  event_dt|           cust_text|
+----------+--------------------+
|2020-09-02|hi fine i want to go|
|2020-09-02|i need  a line hold |
|2020-09-02|i have the  60 packs|
|2020-09-02|hello want you teach|

下面是单列辅助数据框.辅助数据框中的单词需要从 cust_text 列中的主数据框中删除，无论这些单词出现在哪里.例如，'want' 将从每一行中删除它出现在主数据框中的任何位置(在本例中将从第 1 行和第 4 行中删除).

Below is single-column secondary data frame. The words in the secondary data frame need to be removed from the main data frame in column cust_text wherever the words occur. For example, 'want' will be removed from every row wherever it shows up in the main data frame (in this example will be removed from 1st and 4th row).

+-------+
|column1|
+-------+
|   want|
|because|
|   need|
|  hello|
|      a|
|   have|
|     go|
+-------+

event_dt 列将保持原样，每一行将保持原样，只有辅助数据框字从结果数据框中的主数据框中删除，如下所示

The event_dt column will remain as is and each row will remain as is, only the secondary data frame words are removed from main data frame in the result data frame as shown below

+----------+--------------------+
|  event_dt|           cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to        |
|2020-09-02|i line hold         |
|2020-09-02|i the 60 packs      |
|2020-09-02|you teach           |
+----------+--------------------+

感谢帮助！！

将该列设为数组以备将来使用

df = df.withColumn("col2", F.split("col2", " "))
df.show(truncate=False)
df_lookup = spark.createDataFrame([(1,"want"),(1,"because"), (1, "need"), (1, "hello"),(1, "a"),(1, "give"), (1, "go")],[ "col1","col2"])
df_lookup.show()

输出

+----------+---------------------------+
|col1      |col2                       |
+----------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|
|2020-09-02|[i, need, , a, line, hold] |
|2020-09-02|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach]  |
+----------+---------------------------+

+----+-------+
|col1|   col2|
+----+-------+
|   1|   want|
|   1|because|
|   1|   need|
|   1|  hello|
|   1|      a|
|   1|   give|
|   1|     go|
+----+-------+

现在，只需将查找数据框分组并获取变量中的所有查找值，如下所示

df_lookup_var = df_lookup.groupBy("col1").agg(F.collect_set("col2").alias("col2")).collect()[0][1]
print(df_lookup_var)
x = ",".join(df_lookup_var)
print(x)
df = df.withColumn("filter_col", F.lit(x))
df = df.withColumn("filter_col", F.split("filter_col", ","))
df.show(truncate=False)

这个就行了

df = df.withColumn("ArrayColumn", F.array_except("col2", "filter_col"))
df.show(truncate = False)
+----------+---------------------------+-----------------------------------------+---------------------------+
|col1      |col2                       |filter_col                               |ArrayColumn                |
+----------+---------------------------+-----------------------------------------+---------------------------+
|2020-09-02|[hi, fine, i, want, to, go]|[need, want, a, because, hello, give, go]|[hi, fine, i, to]          |
|2020-09-02|[i, need, , a, line, hold] |[need, want, a, because, hello, give, go]|[i, , line, hold]          |
|2020-09-02|[i, have, the, , 60, packs]|[need, want, a, because, hello, give, go]|[i, have, the, , 60, packs]|
|2020-09-02|[hello, want, you, teach]  |[need, want, a, because, hello, give, go]|[you, teach]               |
+----------+---------------------------+-----------------------------------------+---------------------------+

这篇关于根据来自另一个 pyspark 数据帧的单词从 pyspark 数据帧中删除单词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据来自另一个 pyspark 数据帧的单词从 pyspark 数据帧中删除单词 [英] Remove words from pyspark dataframe based on words from another pyspark dataframe

问题描述

推荐答案

将该列设为数组以备将来使用

输出

现在，只需将查找数据框分组并获取变量中的所有查找值，如下所示

这个就行了

相关文章

Python最新文章

热门教程

热门工具

登录关闭

根据来自另一个 pyspark 数据帧的单词从 pyspark 数据帧中删除单词 [英] Remove words from pyspark dataframe based on words from another pyspark dataframe

问题描述

推荐答案

将该列设为数组以备将来使用

输出

现在，只需将查找数据框分组并获取变量中的所有查找值，如下所示

这个就行了

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭