Spark:减去数据帧,但保留重复值 [英] Spark: subtract dataframes but preserve duplicate values
问题描述
假设我有两个Spark SQL数据帧 A
和 B
.我想从 A
中的项目中减去 B
中的项目,同时保留 A
中的重复项.
Suppose I have two Spark SQL dataframes A
and B
. I want to subtract the items in B
from the items in A
while preserving duplicates from A
.
我按照说明使用了在另一个StackOverflow问题(),但是该函数从原始数据帧 A
中删除所有重复项.
I followed the instructions to use DataFrame.except()
that I found in another StackOverflow question ("Spark: subtract two DataFrames"), but that function removes all duplicates from the original dataframe A
.
作为一个示例,如果我有两个数据框:
As a conceptual example, if I have two dataframes:
words = [the, quick, fox, a, brown, fox]
stopWords = [the, a]
然后我要以任意顺序输出:
then I want the output to be, in any order:
words - stopWords = [quick, brown, fox, fox]
我观察到RDD函数 subtract()
会保留重复项,但是Spark-SQL函数 except()
会在结果数据帧中删除重复项.我不明白为什么 except()
输出仅产生唯一值.
I observed that the RDD function subtract()
preserves the duplicates, but the Spark-SQL function except()
removes duplicates in the resulting data frame. I don't understand why the except()
output produces only unique values.
这是一个完整的演示:
// ---------------------------------------------------------------
// EXAMPLE USING RDDs
// ---------------------------------------------------------------
var wordsRdd = sc.parallelize(List("the", "quick", "fox", "a", "brown", "fox"))
var stopWordsRdd = sc.parallelize(List("a", "the"))
var wordsWithoutStopWordsRdd = wordsRdd.subtract(stopWordsRdd)
wordsWithoutStopWordsRdd.take(10)
// res11: Array[String] = Array(quick, brown, fox, fox)
// ---------------------------------------------------------------
// EXAMPLE USING DATAFRAMES
// ---------------------------------------------------------------
var wordsDf = wordsRdd.toDF()
var stopWordsDf = stopWords.toDF()
var wordsWithoutStopWordsDf = wordsDf.except(stopWordsDf)
wordsWithoutStopWordsDf.show(10)
// +-----+
// |value|
// +-----+
// | fox|
// |brown|
// |quick|
// +-----+
我要保留重复项,因为我正在生成频率表.
I want to preserve duplicates because I am generating frequency tables.
任何帮助将不胜感激.
推荐答案
val words = sc.parallelize(List("the", "quick", "fox", "a", "brown", "fox")).toDF("id")
val stopwords = sc.parallelize(List("a", "the")).toDF("id")
words.join(stopwords, words("id") === stopwords("id"), "left_outer")
.where(stopwords("id").isNull)
.select(words("id")).show()
输出为:
+-----+
| id|
+-----+
| fox|
| fox|
|brown|
|quick|
+-----+
这篇关于Spark:减去数据帧,但保留重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!