在 Spark SQL 中交叉连接计算 [英] Cross Join for calculation in Spark SQL
问题描述
我有一个只有 1
记录/值的临时视图,我想使用该值来计算另一个大表中存在的客户的年龄(100
M 行).我使用了 CROSS JOIN
子句,这导致了性能问题.
I have a temporary view with only 1
record/value and I want to use that value to calculate the age of the customers present in another big table (with 100
M rows). I used a CROSS JOIN
clause, which is resulting in a performance issue.
有没有更好的方法来实现这个要求,它会表现得更好?broadcast
提示是否适用于这种情况?处理此类情况的推荐方法是什么?
Is there a better approach to implement this requirement which is will perform better ? Will a broadcast
hint be suitable in this scenario ? What is the recommended approach to tackle such scenarios ?
参考表:(仅包含1
值)
create temporary view ref
as
select to_date(refdt, 'dd-MM-yyyy') as refdt --returns only 1 value
from tableA
where logtype = 'A';
客户表(1000 万行):
Cust table (10 M rows):
custid | birthdt
A1234 | 20-03-1980
B3456 | 09-05-1985
C2356 | 15-12-1990
查询(计算年龄 w.r.t birthdt
):
Query (calculate age w.r.t birthdt
):
select
a.custid,
a.birthdt,
cast((datediff(b.ref_dt, a.birthdt)/365.25) as int) as age
from cust a
cross join ref b;
我的问题是 - 有没有更好的方法来实现这个要求?
My question is - Is there a better approach to implement this requirement ?
谢谢
推荐答案
很难准确理解你的观点,但如果你不能使用 Scala
或 pyspark
和 >dataframes
和 .cache
等然后我认为不要使用 temporary view
,只需创建一个单行 table
.我的印象是你在笔记本中使用 Spark %sql,比如 Databricks.
Hard to work out exactly your point, but if you cannot use Scala
or pyspark
and dataframes
with .cache
etc. then I think that instead of of using a temporary view
, just create a single row table
. My impression is you are using Spark %sql in a notebook on, say, Databricks.
这是我的怀疑.
也就是说,broadcastjoin
提示很可能意味着优化器只发送 1 行.请参阅 https:///jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hint-framework.html#specifying-query-hints
That said a broadcastjoin
hint may well mean the optimizer only sends out 1 row. See https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hint-framework.html#specifying-query-hints
这篇关于在 Spark SQL 中交叉连接计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!