一个过滤数据框Pyspark与类似SQL的IN子句 [英] Filtering a Pyspark DataFrame with SQL-like IN clause

查看:1741
本文介绍了一个过滤数据框Pyspark与类似SQL的IN子句的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想筛选Pyspark数据框用类似SQL的条款,如

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in

sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')

其中, A 是元组(1,2,3)。我得到这个错误:

where a is the tuple (1, 2, 3). I am getting this error:

了java.lang.RuntimeException:[1.67]失败:``('',但却标识符的发现

java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found

这基本上是说人们期待着什么样的的(1,2,3)的代替。
现在的问题是,因为它是从另一个作业中提取我不能手动编写的值。

which is basically saying it was expecting something like '(1, 2, 3)' instead of a. The problem is I can't manually write the values in a as it's extracted from another job.

我怎么会在这种情况下,过滤器?

How would I filter in this case?

推荐答案

字符串传递给 SQLContext 它在SQL环境的范围进行评估。它不捕获关闭。如果你想传递一个变量,你必须做明确使用字符串格式化:

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:

df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
##  2 

显然,这是不是你会在一个真正的SQL环境中使用,由于安全方面的考虑,但这一问题并不应该在这里。

Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.

在实践中数据帧 DSL是当你想要创建动态查询一个多的选择:

In practice DataFrame DSL is a much choice when you want to create dynamic queries:

from pyspark.sql.functions import col

df.where(col("v").isin({"foo", "bar"})).count()
## 2

这是很容易建立和撰写并处理HiveQL / SQL星火的所有细节你。

It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

这篇关于一个过滤数据框Pyspark与类似SQL的IN子句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆