使用类似 SQL 的 IN 子句过滤 Pyspark DataFrame [英] Filtering a Pyspark DataFrame with SQL-like IN clause

查看:39
本文介绍了使用类似 SQL 的 IN 子句过滤 Pyspark DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用类似 SQL 的 IN 子句过滤 Pyspark DataFrame,如

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in

sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')

其中 a 是元组 (1, 2, 3).我收到此错误:

where a is the tuple (1, 2, 3). I am getting this error:

java.lang.RuntimeException: [1.67] 失败: ``('' 预期但标识符 a 找到

java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found

这基本上是说它期待像 '(1, 2, 3)' 而不是 a.问题是我无法在 a 中手动写入值,因为它是从另一个作业中提取的.

which is basically saying it was expecting something like '(1, 2, 3)' instead of a. The problem is I can't manually write the values in a as it's extracted from another job.

在这种情况下我将如何过滤?

How would I filter in this case?

推荐答案

传递给 SQLContext 的字符串,它在 SQL 环境的范围内求值.它不捕获闭包.如果你想传递一个变量,你必须使用字符串格式明确地做到这一点:

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:

df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
##  2 

显然,这不是您在真实"环境中会使用的东西.SQL 环境出于安全考虑,但在这里应该无关紧要.

Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.

在实践中 DataFrame DSL 是一个更好的选择,当你想创建动态查询时:

In practice DataFrame DSL is a much better choice when you want to create dynamic queries:

from pyspark.sql.functions import col

df.where(col("v").isin({"foo", "bar"})).count()
## 2

很容易构建和组合并为您处理 HiveQL/Spark SQL 的所有细节.

It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

这篇关于使用类似 SQL 的 IN 子句过滤 Pyspark DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆