SPARK数据帧过滤：保留元素属于列表 [英] SPARK DataFrame filtering: retain element belonging to a list

查看：556 发布时间：2016/5/22 16:03:29 scala apache-spark apache-spark-sql spark-dataframe apache-zeppelin

本文介绍了SPARK数据帧过滤：保留元素属于列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对齐柏林笔记本使用SPARK 1.5.1使用Scala。

I am using SPARK 1.5.1 with Scala on Zeppelin notebook.

我有一个叫做用户ID与长型列一个数据帧。

在总我有大约400万行，200,000个唯一的用户ID。

我也有5万用户ID名单中排除。

我可以轻松地构建用户ID列表中保留。

什么是删除属于用户排除所有行的最佳方式？

What is the best way to delete all the rows that belong to the users to exclude?

要问同一个问题的另一种方式是：什么是保持属于用户保留行的最好方法

Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain?

我看了<一个href=\"http://stackoverflow.com/questions/31396228/how-do-i-filter-rows-based-on-whether-a-column-value-is-in-a-set-of-strings-in-a?answertab=votes#tab-top\">this帖子和应用及其解决方案（见code以下），但执行是缓慢的，知道我是我的本地机器上运行SPARK 1.5.1中，我有16GB和初始体面的RAM内存数据框装配在存储器

I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1.5.1 on my local machine, an I have decent RAM memory of 16GB and the initial DataFrame fits in the memory.

下面是我运用code：

Here is the code that I am applying:

import org.apache.spark.sql.functions.lit
val finalDataFrame = initialDataFrame.where($"userID".in(listOfUsersToKeep.map(lit(_)):_*))

在上面的code：

的initialDataFrame有3885068行，每行有5列，这些列称为用户ID之一，它包含长值。

的listOfUsersToKeep是一个数组[龙]，它包含150000龙的用户ID。

我不知道是否有比我现在用的是一个更有效的解决方案。

I wonder if there is a more efficient solution than the one I am using.

感谢

推荐答案

您可以使用加入：

val usersToKeep = sc.parallelize(
  listOfUsersToKeep.map(Tuple1(_))).toDF("userID_")

val finalDataFrame = usersToKeep
  .join(initialDataFrame, $"userID" === $"userID_")
  .drop("userID_")

或广播变量和UDF：

or a broadcast variable and an UDF:

import org.apache.spark.sql.functions.udf

val usersToKeepBD = sc.broadcast(listOfUsersToKeep.toSet)
val checkUser = udf((id: Long) => usersToKeepBD.value.contains(id))
val finalDataFrame = initialDataFrame.where(checkUser($"userID"))

应该也可以广播一个数据帧

It should be also possible to broadcast a DataFrame:

import org.apache.spark.sql.functions.broadcast

initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")

这篇关于SPARK数据帧过滤：保留元素属于列表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SPARK数据帧过滤：保留元素属于列表 [英] SPARK DataFrame filtering: retain element belonging to a list

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

SPARK数据帧过滤：保留元素属于列表 [英] SPARK DataFrame filtering: retain element belonging to a list

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭