Spark DataFrame 过滤:保留属于列表的元素 [英] Spark DataFrame filtering: retain element belonging to a list

查看：50 发布时间：2021/11/14 23:52:08 scala apache-spark dataframe apache-spark-sql apache-zeppelin

本文介绍了Spark DataFrame 过滤:保留属于列表的元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 Zeppelin 笔记本上使用 Spark 1.5.1 和 Scala.

I am using Spark 1.5.1 with Scala on Zeppelin notebook.

我有一个 DataFrame，其中有一列名为 userID 的 Long 类型.
我总共有大约 400 万行和 200,000 个唯一用户 ID.
我还有一个要排除的 50,000 个用户 ID 的列表.
我可以轻松构建要保留的用户 ID 列表.

删除属于要排除的用户的所有行的最佳方法是什么?

What is the best way to delete all the rows that belong to the users to exclude?

提出相同问题的另一种方法是:保留属于用户的行的最佳方法是什么?

Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain?

我看到了这篇文章并应用了它的解决方案(见下面的代码)，但执行速度很慢，因为我知道我正在运行 SPARK 1.5.1 在我的本地机器上，我有 16GB 的不错的 RAM 内存，并且初始 DataFrame 适合内存.

I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1.5.1 on my local machine, an I have decent RAM memory of 16GB and the initial DataFrame fits in the memory.

这是我正在申请的代码:

Here is the code that I am applying:

import org.apache.spark.sql.functions.lit
val finalDataFrame = initialDataFrame.where($"userID".in(listOfUsersToKeep.map(lit(_)):_*))

在上面的代码中:

initialDataFrame 有 3885068 行，每行有 5 列，其中一列称为 userID，它包含 Long 值.
listOfUsersToKeep 是一个数组[Long]，它包含 150,000 个 Long 用户 ID.

我想知道是否有比我正在使用的解决方案更有效的解决方案.

I wonder if there is a more efficient solution than the one I am using.

谢谢

推荐答案

你可以使用join:

val usersToKeep = sc.parallelize(
  listOfUsersToKeep.map(Tuple1(_))).toDF("userID_")

val finalDataFrame = usersToKeep
  .join(initialDataFrame, $"userID" === $"userID_")
  .drop("userID_")

或广播变量和 UDF:

or a broadcast variable and an UDF:

import org.apache.spark.sql.functions.udf

val usersToKeepBD = sc.broadcast(listOfUsersToKeep.toSet)
val checkUser = udf((id: Long) => usersToKeepBD.value.contains(id))
val finalDataFrame = initialDataFrame.where(checkUser($"userID"))

也应该可以广播数据帧:

It should be also possible to broadcast a DataFrame:

import org.apache.spark.sql.functions.broadcast

initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")

这篇关于Spark DataFrame 过滤:保留属于列表的元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark DataFrame 过滤:保留属于列表的元素 [英] Spark DataFrame filtering: retain element belonging to a list

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark DataFrame 过滤:保留属于列表的元素 [英] Spark DataFrame filtering: retain element belonging to a list

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭