什么是筛选数据框的最有效方法 [英] What's the most efficient way to filter a DataFrame

查看:317
本文介绍了什么是筛选数据框的最有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

...通过检查列的值是否在 SEQ 。结果
也许我没有解释这很好,我基本上要这样(能恩preSS使用常规的SQL吧):<?code> DF_Column序列

首先,我做了它使用的是广播VAR (其中我把SEQ), UDF (即做了检查), registerTempTable 。结果
问题是,我没有测试它,因为我遇到了一个已知的bug ,显然只使用 registerTempTable 用时会出现的 ScalaIDE

我结束了创建一个新的数据帧的SEQ 和做内与它(路口)加入,但我怀疑这是完成任务的最高效的方法。

感谢

编辑:(响应@YijieShen):结果
怎么办过滤基于一个数据帧的专栏中的元素是否在另一个DF的列(如SQL <$ C $ C> SELECT * FROM A,其中在登录(选择2.用户名))?

例如:
首先DF:

 登录计数
login1 192
Login2身份146
login3 72

二DF:

 用户名
Login2身份
login3
login4

结果:

 登录计数
Login2身份146
login3 72

尝试:结果
修改-2:我认为,现在的错误是固定的,这应该工作。 编辑完-2

  ordered.select(登录)。过滤器($登陆。载有(empLogins(用户名)))

  ordered.select(登录)。过滤器(在empLogins $密码(用户名))

这两个罚球异常线程mainorg.apache.spark.sql.AnalysisException分别为

 解决属性(S)的用户名中排名第10操作员登录#8人失踪
!过滤器包含(登录#8,#名10);

 解决属性(S)的用户名中排名第10操作员登录#8人失踪
!过滤器登录中排名第8(用户名#10);


解决方案

  1. 您应该比线性播出设置,而不是阵列,更快的搜索


  2. 您可以让Eclipse中运行你的应用程序星火。具体方法如下:


由于在邮件列表上指出,火花SQL假设其类由原始的类加载器加载。这不是在Eclipse的情况下,是在Java和Scala库加载为引导类路径的一部分,而用户code和它的依赖是另一个。您可以轻松地修复,在启动配置对话框:


  • 从引导项删除Scala库和Scala编译器

  • 添加(外部罐)斯卡拉反省的Scala库斯卡拉-compiler 到用户条目。

对话框看起来应该是这样的:


  

编辑:是固定的星火错误,这种解决方法不再是必要的(因为1.4版.0)


... by checking whether a columns' value is in a seq.
Perhaps I'm not explaining it very well, I basically want this (to express it using regular SQL): DF_Column IN seq?

First I did it using a broadcast var (where I placed the seq), UDF (that did the checking) and registerTempTable.
The problem is that I didn't get to test it since I ran into a
known bug that apparently only appears when using registerTempTable with ScalaIDE.

I ended up creating a new DataFrame out of seq and doing inner join with it (intersection), but I doubt that's the most performant way of accomplishing the task.

Thanks

EDIT: (in response to @YijieShen):
How to do filter based on whether elements of one DataFrame's column are in another DF's column (like SQL select * from A where login in (select username from B))?

E.g: First DF:

login      count
login1     192  
login2     146  
login3     72   

Second DF:

username
login2
login3
login4

The result:

login      count
login2     146  
login3     72   

Attempts:
EDIT-2: I think, now that the bug is fixed, these should work. END EDIT-2

ordered.select("login").filter($"login".contains(empLogins("username")))

and

ordered.select("login").filter($"login" in empLogins("username"))

which both throw Exception in thread "main" org.apache.spark.sql.AnalysisException, respectively:

resolved attribute(s) username#10 missing from login#8 in operator 
!Filter Contains(login#8, username#10);

and

resolved attribute(s) username#10 missing from login#8 in operator 
!Filter login#8 IN (username#10);

解决方案

  1. You should broadcast a Set, instead of an Array, much faster searches than linear.

  2. You can make Eclipse run your Spark application. Here's how:

As pointed out on the mailing list, spark-sql assumes its classes are loaded by the primordial classloader. That's not the case in Eclipse, were the Java and Scala library are loaded as part of the boot classpath, while the user code and its dependencies are in another one. You can easily fix that in the launch configuration dialog:

  • remove Scala Library and Scala Compiler from the "Bootstrap" entries
  • add (as external jars) scala-reflect, scala-library and scala-compiler to the user entry.

The dialog should look like this:

Edit: The Spark bug was fixed and this workaround is no longer necessary (since v. 1.4.0)

这篇关于什么是筛选数据框的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆