Spark:减去数据帧但保留重复值 [英] Spark: subtract dataframes but preserve duplicate values

查看:32
本文介绍了Spark:减去数据帧但保留重复值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有两个 Spark SQL 数据帧 AB.我想从 A 中的项目中减去 B 中的项目,同时保留 A 中的重复项.

Suppose I have two Spark SQL dataframes A and B. I want to subtract the items in B from the items in A while preserving duplicates from A.

我按照在另一个 StackOverflow 问题("Spark:减去两个 DataFrames"),但该函数会从原始数据帧 A 中删除所有重复项.

I followed the instructions to use DataFrame.except() that I found in another StackOverflow question ("Spark: subtract two DataFrames"), but that function removes all duplicates from the original dataframe A.

作为概念示例,如果我有两个数据框:

As a conceptual example, if I have two dataframes:

words     = [the, quick, fox, a, brown, fox]
stopWords = [the, a]

然后我希望以任何顺序输出:

then I want the output to be, in any order:

words - stopWords = [quick, brown, fox, fox]

我观察到 RDD 函数 subtract() 保留了重复项,但 Spark-SQL 函数 except() 删除了结果数据框中的重复项.我不明白为什么 except() 输出只产生唯一值.

I observed that the RDD function subtract() preserves the duplicates, but the Spark-SQL function except() removes duplicates in the resulting data frame. I don't understand why the except() output produces only unique values.

这是一个完整的演示:

// ---------------------------------------------------------------
// EXAMPLE USING RDDs
// ---------------------------------------------------------------
var wordsRdd = sc.parallelize(List("the", "quick", "fox", "a", "brown", "fox"))
var stopWordsRdd = sc.parallelize(List("a", "the"))

var wordsWithoutStopWordsRdd = wordsRdd.subtract(stopWordsRdd)
wordsWithoutStopWordsRdd.take(10)
// res11: Array[String] = Array(quick, brown, fox, fox)

// ---------------------------------------------------------------
// EXAMPLE USING DATAFRAMES
// ---------------------------------------------------------------
var wordsDf = wordsRdd.toDF()
var stopWordsDf = stopWords.toDF()
var wordsWithoutStopWordsDf = wordsDf.except(stopWordsDf)

wordsWithoutStopWordsDf.show(10)
// +-----+
// |value|
// +-----+
// |  fox|
// |brown|
// |quick|
// +-----+

我想保留重复项,因为我正在生成频率表.

I want to preserve duplicates because I am generating frequency tables.

任何帮助将不胜感激.

推荐答案

val words = sc.parallelize(List("the", "quick", "fox", "a", "brown", "fox")).toDF("id")
val stopwords = sc.parallelize(List("a", "the")).toDF("id")


words.join(stopwords, words("id") === stopwords("id"), "left_outer")
     .where(stopwords("id").isNull)
     .select(words("id")).show()

输出为:

+-----+
|   id|
+-----+
|  fox|
|  fox|
|brown|
|quick|
+-----+

这篇关于Spark:减去数据帧但保留重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆