过滤 DataFrame 的最有效方法是什么 [英] What's the most efficient way to filter a DataFrame

查看:39
本文介绍了过滤 DataFrame 的最有效方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

... 通过检查列的值是否在 seq 中.
也许我没有很好地解释它,我基本上想要这个(使用常规 SQL 来表达它):DF_Column IN seq?

首先,我使用 broadcast var(放置序列的位置)、UDF(进行检查)和 registerTempTable 来完成它.
问题是我没有去测试它,因为我遇到了一个 已修复,不再需要此解决方法(自 1.4.0 版起)

... by checking whether a columns' value is in a seq.
Perhaps I'm not explaining it very well, I basically want this (to express it using regular SQL): DF_Column IN seq?

First I did it using a broadcast var (where I placed the seq), UDF (that did the checking) and registerTempTable.
The problem is that I didn't get to test it since I ran into a known bug that apparently only appears when using registerTempTable with ScalaIDE.

I ended up creating a new DataFrame out of seq and doing inner join with it (intersection), but I doubt that's the most performant way of accomplishing the task.

Thanks

EDIT: (in response to @YijieShen):
How to do filter based on whether elements of one DataFrame's column are in another DF's column (like SQL select * from A where login in (select username from B))?

E.g: First DF:

login      count
login1     192  
login2     146  
login3     72   

Second DF:

username
login2
login3
login4

The result:

login      count
login2     146  
login3     72   

Attempts:
EDIT-2: I think, now that the bug is fixed, these should work. END EDIT-2

ordered.select("login").filter($"login".contains(empLogins("username")))

and

ordered.select("login").filter($"login" in empLogins("username"))

which both throw Exception in thread "main" org.apache.spark.sql.AnalysisException, respectively:

resolved attribute(s) username#10 missing from login#8 in operator 
!Filter Contains(login#8, username#10);

and

resolved attribute(s) username#10 missing from login#8 in operator 
!Filter login#8 IN (username#10);

解决方案

  1. You should broadcast a Set, instead of an Array, much faster searches than linear.

  2. You can make Eclipse run your Spark application. Here's how:

As pointed out on the mailing list, spark-sql assumes its classes are loaded by the primordial classloader. That's not the case in Eclipse, were the Java and Scala library are loaded as part of the boot classpath, while the user code and its dependencies are in another one. You can easily fix that in the launch configuration dialog:

  • remove Scala Library and Scala Compiler from the "Bootstrap" entries
  • add (as external jars) scala-reflect, scala-library and scala-compiler to the user entry.

The dialog should look like this:

Edit: The Spark bug was fixed and this workaround is no longer necessary (since v. 1.4.0)

这篇关于过滤 DataFrame 的最有效方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆