SPARK SQL 中的相关子查询列不允许作为非等式谓词的一部分 [英] Correlated sub query column in SPARK SQL is not allowed as part of a non-equality predicate

查看:68
本文介绍了SPARK SQL 中的相关子查询列不允许作为非等式谓词的一部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 where 子句中编写一个子查询,如下所示.但我得到非等式谓词中不允许使用相关列:"

I am tryng to write a subquery in where clause like below. But i am getting "Correlated column is not allowed in a non-equality predicate:"

SELECT *, 
   holidays 
      FROM   ( 
          SELECT *, 
                 s.holidays, 
                 s.entity 
          FROM   transit_t tt 
          WHERE  ( 
                        SELECT Count(thedate) AS holidays 
                        FROM   fact_ent_rt 
                        WHERE  entity=tt.awborigin 
                        AND    ( 
                                      Substring(thedate,1,10)) BETWEEN (Substring(awbpickupdate,1,10)) AND    (
                                      Substring(deliverydate,1,10)) 
                        AND    ( 
                                      nholidayflag = true 
                               OR     weekendflag = true))) s

此查询的任何问题.因为我认为 spark > 2.0 支持 where 子句中的子查询.任何建议,将不胜感激.谢谢

Any issues with this query. because i thought spark >2.0 supported subqueries in where clause. any suggestions would be appreciated. Thanks

输入将是运输表中的取货日期和交货日期.我们需要找出这些日期之间是否有周末(此数据可在 fact_ent_rt 中获得)并计算假期数.

The Input will be pick up date and delivery date from transit table. We need to find if there is a weekend falling between these dates(this data is available in fact_ent_rt ) and take the count of no of holidays.

我得到的输出是pyspark.sql.utils.AnalysisException: u"非等式谓词中不允许使用相关列:\nAggregate

The output i am getting is pyspark.sql.utils.AnalysisException: u"Correlated column is not allowed in a non-equality predicate:\nAggregate

样本输入:

输入 1:+---------++-------------------+------------+

Input 1: +---------++-------------------+-------------------+

|AWBOrigin||      AWBPickupDate|       DeliveryDate|

+---------++-------------------+-------------------+

|      LON||2018-09-01 08:52:00|2018-09-12 13:57:00|
|      DHA||2018-09-04 11:47:00|2018-09-08 07:30:00|
|      NIC||2009-01-01 01:47:00|2009-01-09 11:37:00
+---------+-----------+-----------------------------

输入 2 (fact_ent)

Input 2 (fact_ent)

------+-------------------+-----------+------------

 Entity|            TheDate|WeekendFlag|NHolidayFlag

 ------+-------------------+-----------+------------

NIC|2009-01-01 00:00:00|      False|       False
NIC|2009-01-02 00:00:00|      False|       False
NIC|2009-01-03 00:00:00|       True|       False
NIC|2009-01-04 00:00:00|       True|       False
NIC|2009-01-05 00:00:00|      False|       False
NIC|2009-01-06 00:00:00|      False|       False
NIC|2009-01-07 00:00:00|      False|       False
NIC|2009-01-08 00:00:00|      False|       False
NIC|2009-01-09 00:00:00|      False|       False
NIC|2009-01-10 00:00:00|       True|       False
NIC|2009-01-11 00:00:00|       True|       False
NIC|2009-01-12 00:00:00|      False|       False
NIC|2009-01-13 00:00:00|      False|       False
NIC|2009-01-14 00:00:00|      False|       False
NIC|2009-01-15 00:00:00|      False|       False
NIC|2009-01-16 00:00:00|      False|       False
NIC|2009-01-17 00:00:00|       True|       False
NIC|2009-01-18 00:00:00|       True|       False
NIC|2009-01-19 00:00:00|      False|       False
NIC|2009-01-20 00:00:00|      False|       False
------+-------------------+-----------+------------

预期输出

 +---------++-------------------+-------------------+

|AWBOrigin||      AWBPickupDate|       DeliveryDate| Holidays

+---------++-------------------+-------------------+

|      LON||2018-09-01 08:52:00|2018-09-12 13:57:00|  NA
|      DHA||2018-09-04 11:47:00|2018-09-08 07:30:00|  NA
|      NIC||2009-01-01 01:47:00|2009-01-09 11:37:00|  2
+---------+-----------+-----------------------------

推荐答案

我用 SCALA 做了这个,所以你需要转换,但我认为是一种更简单的方式.我添加了一个键并在键级别上做了,你可以适应和聚合它.但原理要简单得多.不需要相关的子查询.只是关系演算.用于日期等的数字

I did this with SCALA so you will need to convert but in a far easier way I think. I added a key and did at key level, you can adapt and aggr that out. But principle is far simpler. No correlated sub queries required. Just relational calculus. Used number for dates, etc.

// SCALA 
// Slightly ambiguous on hols vs. weekend, as you stated treated as 1

import spark.implicits._ 
import org.apache.spark.sql.functions._

val dfE = Seq( 
              ("NIC", 1, false, false),
              ("NIC", 2, false, false),
              ("NIC", 3, true, false),
              ("NIC", 4, true, true),
              ("NIC", 5, false, false),
              ("NIC", 6, false, false),
              ("XYZ", 1, false, true)
              ).toDF("e","d","w", "h")
 //dfE.show(false)

 val dfE2 = dfE.withColumn("wh", when ($"w" or $"h", 1) otherwise (0)).drop("w").drop("h")
 //dfE2.show()

//Assuming more dfD's can exist
val dfD = Seq( 
              ("NIC", 1, 4, "k1"),
              ("NIC", 2, 3, "k2"),
              ("NIC", 1, 1, "k3"),
              ("NIC", 7, 10, "k4")
              ).toDF("e","pd","dd", "k")
//dfD.show(false)

dfE2.createOrReplaceTempView("E2")
dfD.createOrReplaceTempView("D1")

// This done per record, if over identical keys, then strip k and aggr otherwise, I added k for checking each entry
// Point is it is far easier. Key means synthetic grouping by.

val q=sqlContext.sql(""" SELECT d1.k, d1.e, d1.pd, d1.dd, sum(e2.wh) 
                       FROM D1, E2
                      WHERE D1.e = E2.e 
                        AND E2.d >= D1.pd
                        AND E2.d <= D1.dd
                    GROUP BY d1.k, d1.e, d1.pd, d1.dd   
                    ORDER BY d1.k, d1.e, d1.pd, d1.dd
                     """)
q.show

返回:

 +---+---+---+---+-------+
 |  k|  e| pd| dd|sum(wh)|
 +---+---+---+---+-------+
 | k1|NIC|  1|  4|      2|
 | k2|NIC|  2|  3|      1|
 | k3|NIC|  1|  1|      0|
 +---+---+---+---+-------+

我认为可以进行简单的性能改进.实际上不需要相关的东西.

I think a simple performance improvement can be made. No correlated stuff req'd in fact.

如果需要,可以在 D1.pd AND D1.dd 之间使用 AND E2.d.

Can use AND E2.d BETWEEN D1.pd AND D1.dd if you want.

这篇关于SPARK SQL 中的相关子查询列不允许作为非等式谓词的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆