Apache Spark选择所有行 [英] Apache Spark selects all rows
问题描述
当我使用 JDBC 连接来馈送spark时,即使我对数据帧使用了过滤;当我检查oracle数据源上的查询日志时,我看到火花执行:
When I use JDBC connection to feed spark, even if I use filtering on dataframe; when I inspect query log on my oracle datasource, I am seeing spark executing:
从MY_TABLE中选择[列名]
引用 https://stackoverflow.com/a/40870714/1941560
我期待火花懒惰地计划查询和执行;
I was expecting spark lazily plan query and execute like;
从MY_TABLE WHERE [filter_predicate]
但是spark并没有这样做.之后将获取所有数据并进行过滤.我需要这种行为,因为我不想每x分钟检索一次所有表,而只更改了行( UPDATE_DATE
的增量筛选).
But spark is not doing that. It takes all the data and filters afterwards. I need this behaviour because I don't want to retrieve all the table every x minutes but only changed rows (incremental filterin by UPDATE_DATE
).
有没有办法做到这一点?
Is there a way to achieve this?
这是我的python代码:
Here is my python code:
df = ...
lookup_seconds = 5 * 60;
now = datetime.datetime.now(pytz.timezone("some timezone"))
max_lookup_datetime = now - datetime.timedelta(seconds=lookup_seconds)
df.where(df.UPDATE_DATE > max_lookup_datetime).explain()
说明结果:
Physical Plan == *Filter (isnotnull(UPDATE_DATE#21) && (UPDATE_DATE#21 > 1516283483208806)) +- Scan ExistingRDD[NO#19,AMOUNT#20,UPDATE_DATE#21,CODE#22,AMOUNT_OLD#23]
编辑:完整的答案是此处
推荐答案
摘自官方文档 1 :
dbtable应该读取的JDBC表.请注意,可以使用在SQL查询的FROM子句中有效的任何东西.例如,除了完整表之外,您还可以在括号中使用子查询.
dbtable The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.
您可以将JDBC选项dbtable设置为子查询SQL.例如:
You could set the JDBC option dbtable to a subquery SQL. For example:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "(select * from tbl where UPDATE_DATE > max_lookup_datetime) t") \
.option("user", "username") \
.option("password", "password") \
.load()
这篇关于Apache Spark选择所有行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!