过滤 DataFrame 的最有效方法是什么 [英] What's the most efficient way to filter a DataFrame
问题描述
... 通过检查列的值是否在 seq
中.
也许我没有很好地解释它,我基本上想要这个(使用常规 SQL 来表达它):DF_Column IN seq
?
首先,我使用 broadcast var
(放置序列的位置)、UDF
(进行检查)和 registerTempTable
来完成它.
问题是我没有去测试它,因为我遇到了一个 已修复,不再需要此解决方法(自 1.4.0 版起)
... by checking whether a columns' value is in a seq
.
Perhaps I'm not explaining it very well, I basically want this (to express it using regular SQL): DF_Column IN seq
?
First I did it using a broadcast var
(where I placed the seq), UDF
(that did the checking) and registerTempTable
.
The problem is that I didn't get to test it since I ran into a known bug that apparently only appears when using registerTempTable
with ScalaIDE.
I ended up creating a new DataFrame
out of seq
and doing inner join with it (intersection), but I doubt that's the most performant way of accomplishing the task.
Thanks
EDIT: (in response to @YijieShen):
How to do filter
based on whether elements of one DataFrame
's column are in another DF's column (like SQL select * from A where login in (select username from B)
)?
E.g: First DF:
login count
login1 192
login2 146
login3 72
Second DF:
username
login2
login3
login4
The result:
login count
login2 146
login3 72
Attempts:
EDIT-2: I think, now that the bug is fixed, these should work. END EDIT-2
ordered.select("login").filter($"login".contains(empLogins("username")))
and
ordered.select("login").filter($"login" in empLogins("username"))
which both throw Exception in thread "main" org.apache.spark.sql.AnalysisException
, respectively:
resolved attribute(s) username#10 missing from login#8 in operator
!Filter Contains(login#8, username#10);
and
resolved attribute(s) username#10 missing from login#8 in operator
!Filter login#8 IN (username#10);
You should broadcast a
Set
, instead of anArray
, much faster searches than linear.You can make Eclipse run your Spark application. Here's how:
As pointed out on the mailing list, spark-sql assumes its classes are loaded by the primordial classloader. That's not the case in Eclipse, were the Java and Scala library are loaded as part of the boot classpath, while the user code and its dependencies are in another one. You can easily fix that in the launch configuration dialog:
- remove Scala Library and Scala Compiler from the "Bootstrap" entries
- add (as external jars)
scala-reflect
,scala-library
andscala-compiler
to the user entry.
The dialog should look like this:
Edit: The Spark bug was fixed and this workaround is no longer necessary (since v. 1.4.0)
这篇关于过滤 DataFrame 的最有效方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!