什么是筛选数据框的最有效方法 [英] What's the most efficient way to filter a DataFrame
问题描述
...通过检查列的值是否在 SEQ
。结果
也许我没有解释这很好,我基本上要这样(能恩preSS使用常规的SQL吧):<?code> DF_Column序列
首先,我做了它使用的是广播VAR
(其中我把SEQ), UDF
(即做了检查), registerTempTable
。结果
问题是,我没有测试它,因为我遇到了一个已知的bug ,显然只使用 registerTempTable
用时会出现的 ScalaIDE 的
我结束了创建一个新的数据帧
出的SEQ
和做内与它(路口)加入,但我怀疑这是完成任务的最高效的方法。
感谢
编辑:(响应@YijieShen):结果
怎么办过滤基于一个
C $ C> SELECT * FROM A,其中在登录(选择2.用户名))?数据帧
的专栏中的元素是否在另一个DF的列(如SQL <$
例如:
首先DF:
登录计数
login1 192
Login2身份146
login3 72
二DF:
用户名
Login2身份
login3
login4
结果:
登录计数
Login2身份146
login3 72
尝试:结果
的 修改-2:我认为,现在的错误是固定的,这应该工作。 编辑完-2 的
ordered.select(登录)。过滤器($登陆。载有(empLogins(用户名)))
和
ordered.select(登录)。过滤器(在empLogins $密码(用户名))
这两个罚球异常线程mainorg.apache.spark.sql.AnalysisException分别为
,
解决属性(S)的用户名中排名第10操作员登录#8人失踪
!过滤器包含(登录#8,#名10);
和
解决属性(S)的用户名中排名第10操作员登录#8人失踪
!过滤器登录中排名第8(用户名#10);
-
您应该比线性播出
设置
,而不是阵列
,更快的搜索 -
您可以让Eclipse中运行你的应用程序星火。具体方法如下:
由于在邮件列表上指出,火花SQL假设其类由原始的类加载器加载。这不是在Eclipse的情况下,是在Java和Scala库加载为引导类路径的一部分,而用户code和它的依赖是另一个。您可以轻松地修复,在启动配置对话框:
- 从引导项删除Scala库和Scala编译器
- 添加(外部罐)
斯卡拉反省
,的Scala库
和斯卡拉-compiler
到用户条目。
对话框看起来应该是这样的:
编辑:是固定的星火错误,这种解决方法不再是必要的(因为1.4版.0)
块引用>... by checking whether a columns' value is in a
seq
.
Perhaps I'm not explaining it very well, I basically want this (to express it using regular SQL):DF_Column IN seq
?First I did it using a
broadcast var
(where I placed the seq),UDF
(that did the checking) andregisterTempTable
.
The problem is that I didn't get to test it since I ran into a known bug that apparently only appears when usingregisterTempTable
with ScalaIDE.I ended up creating a new
DataFrame
out ofseq
and doing inner join with it (intersection), but I doubt that's the most performant way of accomplishing the task.Thanks
EDIT: (in response to @YijieShen):
How to dofilter
based on whether elements of oneDataFrame
's column are in another DF's column (like SQLselect * from A where login in (select username from B)
)?E.g: First DF:
login count login1 192 login2 146 login3 72
Second DF:
username login2 login3 login4
The result:
login count login2 146 login3 72
Attempts:
EDIT-2: I think, now that the bug is fixed, these should work. END EDIT-2ordered.select("login").filter($"login".contains(empLogins("username")))
and
ordered.select("login").filter($"login" in empLogins("username"))
which both throw
Exception in thread "main" org.apache.spark.sql.AnalysisException
, respectively:resolved attribute(s) username#10 missing from login#8 in operator !Filter Contains(login#8, username#10);
and
resolved attribute(s) username#10 missing from login#8 in operator !Filter login#8 IN (username#10);
解决方案
You should broadcast a
Set
, instead of anArray
, much faster searches than linear.You can make Eclipse run your Spark application. Here's how:
As pointed out on the mailing list, spark-sql assumes its classes are loaded by the primordial classloader. That's not the case in Eclipse, were the Java and Scala library are loaded as part of the boot classpath, while the user code and its dependencies are in another one. You can easily fix that in the launch configuration dialog:
- remove Scala Library and Scala Compiler from the "Bootstrap" entries
- add (as external jars)
scala-reflect
,scala-library
andscala-compiler
to the user entry.The dialog should look like this:
Edit: The Spark bug was fixed and this workaround is no longer necessary (since v. 1.4.0)
这篇关于什么是筛选数据框的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!