如何选择一组类似的SQL [英] How to select similar sets in SQL

查看:197
本文介绍了如何选择一组类似的SQL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下表格:

Order
----
ID (pk)

OrderItem
----
OrderID (fk -> Order.ID)
ItemID (fk -> Item.ID)
Quantity

Item
----
ID (pk)

我怎么能写,可以选择所有的订单查询是至少85%相似,具体的订单

How can I write a query that can select all Orders that are at least 85% similar to a specific Order?

我认为使用杰卡德指数统计计算两个订单的相似性。 (通过利用每一组的交集的OrderItems 由每个组的联合划分的OrderItems

I considered using the Jaccard Index statistic to calculate the similarity of two Orders. (By taking the intersection of each set of OrderItems divided by the union of each set of OrderItems)

不过,我不能想办法这样做不存储计算的Jaccard指数为两个订单每一个可能的组合。 有另一种方式?

However, I can't think of a way to do so without storing the computed Jaccard Index for each possible combination of two Orders. Is there another way?

此外,有没有一种方法,包括在每一个匹配的差异订单项考虑?

Also, is there a way to include the difference in Quantity of each matched OrderItem into account?

其他信息:

订单:〜79K
的OrderItems :〜1.76米
平均。 的OrderItems 订单:21.5
产品:〜13K

Total Orders: ~79k
Total OrderItems: ~1.76m
Avg. OrderItems per Order: 21.5
Total Items: ~13k

注意

85%的相似度是什么,客户实际需要的只是一个最好的猜测,它可能会在未来改变。一个解决方案,适用于任何的相似性将是preferable。

The 85% similarity number is just a best guess at what the customer actually needs, it may change in the future. A solution that works for any similarity would be preferable.

推荐答案

您指定

我怎么能写,可以选择所有至少有订单查询   85%相似,特定的顺序?

How can I write a query that can select all orders that are at least 85% similar to a specific order?

这是所有对订单比较重要的简化 中至少85%的彼此相似。

This is an important simplification compared with 'all pairs of orders that are at least 85% similar to each other'.

我们将使用一些TDQD(测试驱动的查询设计)和一些分析,以帮助我们。

We'll use some TDQD (Test-Driven Query Design) and some analysis to help us.

要进行远程相似,两个命令必须在至少一个项 共同。该查询可以被用来确定哪些订单具有至少 一个共同的项目有一定的顺序:

To be remotely similar, the two orders must have at least one item in common. This query can be used to determine which orders have at least one item in common with a specified order:

SELECT DISTINCT I1.OrderID AS ID
  FROM OrderItem AS I1
  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
 WHERE I1.OrderID != <specified order ID>

该修剪的其他命令的列表,检查了不少,虽然 如果按照指定的顺序包含您最热门的项目之一,它的 可能有很多其他的订单也这样做了。

This prunes the list of other orders to be examined quite a lot, though if the specified order included one of your most popular items, it's likely that a lot of other orders also did so.

相反的独特的,你可以使用:

Instead of the DISTINCT, you could use:

SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
  FROM OrderItem AS I1
  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
 WHERE I1.OrderID != <specified order ID>
 GROUP BY I1.OrderID

这让你的项目数的顺序,它有共同点 使用指定的顺序。我们还需要在各个项目的数量 顺序:

This gives you the number of items in an order that it has in common with the specified order. We also need the number of items in each order:

SELECT OrderID AS ID, COUNT(*) AS Num_Total
  FROM OrderItem
 GROUP BY OrderID;

相同的订单

有关100%的相似性,两个命令将具有在共同的尽可能多的项目 因为每个人都有项目。这可能不会发现许多双的订单, 虽然。我们可以找到完全一样的项目的订单 指定的顺序很轻松地:

Identical Orders

For 100% similarity, the two orders would have as many items in common as each has items. This would probably not find many pairs of orders, though. We can find the orders with exactly the same items as the specified order easily enough:

SELECT L1.ID
  FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total
          FROM OrderItem
         GROUP BY OrderID
       ) AS L1
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common;

编辑:的这原来不被严格就够了;供的命令是相同的,按照指定的顺序的项目数也必须与在常见的数目:

This turns out not to be stringent enough; for the orders to be identical, the number of items in the specified order must also be the same as the number in common:

SELECT L1.ID, L1.Num_Total, L2.ID, L2.Num_Common, L3.ID, L3.Num_Total
  FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total
          FROM OrderItem
         GROUP BY OrderID
       ) AS L1
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common
  JOIN (SELECT OrderID AS ID, COUNT(*) AS Num_Total
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS L3 ON L2.Num_Common = L3.Num_Total;

类似的法令 - 分析公式

运用杰卡德相似 截至维基百科定义为两个数量级A和B,用| A |作为计数 项目为了在数量时,杰卡德相似的 J(下A,B)= |A∩B| ÷|A∪B| 的,其中|A∩B|是在共同的项目数 这两个命令和|A∪B|是不同的项目的总数 订购。

Similar Orders — Analyzing the Formula

Applying the Jaccard Similarity as defined at Wikipedia to two orders A and B, with |A| being the count of the number of items in order A, the Jaccard Similarity J(A,B) = |A∩B| ÷ |A∪B|, where |A∩B| is the number of items in common to the two orders and |A∪B| is the total number of different items ordered.

要达到85%杰卡德相似准则,如果项目的数量 任一次序小于某个阈值,则命令必须相同。 例如,如果两个定单A和B有5个项目,比方说,但是有一个 项目两者之间的不同,它为您提供了4个项目的共同点(|A∩B|) 和6项总(|A∪B|),所以杰卡德相似J(下A,B)仅 66⅔%。

To meet an 85% Jaccard Similarity criterion, if the number of items in either order is less than some threshold, the orders must be identical. For example, if both orders A and B have 5 items, say, but there's one item different between the two, it gives you 4 items in common (|A∩B|) and 6 items in total (|A∪B|), so the Jaccard Similarity J(A,B) is only 66⅔%.

有关85%的相似性时,有在各两个命令N个项目和 1个项目不同的是,(N-1)÷(N + 1)≥0.85 的,这意味着N> 12 (12⅓为precise)。对于一小部​​分F = J(下A,B),一个项目的不同 手段的(N-1)÷(N + 1)≥˚F的可用于求解N给出的ñ≥(1 + F)÷(1 - F)的。作为相似需求上升时,订单 必须是相同的N个越来越大的价值。

For 85% similarity when there are N items in each of the two orders and 1 item is different, (N-1) ÷ (N+1) ≥ 0.85, which means N > 12 (12⅓ to be precise). For a fraction F = J(A,B), one item different means (N-1) ÷ (N+1) ≥ F which can be solved for N giving N ≥ (1 + F) ÷ (1 - F). As the similarity requirement goes up, the orders must be identical for increasingly large values of N.

泛化进一步,假设我们有不同的大小订单 与N和M项目(不包括的一般性,N&LT损失; M)。最大 A∩B|价值|现N和的最小值|A∪B|为M(意 在较小的订单中所有的项目出现在更大的订单)。让我们 定义M = N +Δ,并且有∂项present在 为了更小的未$在更大的阶数p $ psent。它遵循 有Δ+∂项present在较大的顺序中不存在的 更小的订单。

Generalizing still further, let's suppose we have different size orders with N and M items (without loss of generality, N < M). The maximum value of |A∩B| is now N and the minimum value of |A∪B| is M (meaning all the items in the smaller order appear in the larger order). Let's define that M = N + ∆, and that there are ∂ items present in the smaller order that are not present in the larger order. It follows that there are ∆+∂ items present in the larger order that are not in the smaller order.

根据定义,那么,|A∩B| = N-∂,且|A∪B| =(N-∂)+∂+ (N +Δ-(N-∂)),其中三个加入术语重新present的(1)的数量 两个命令间的共同项目,(2)仅在项目数 较小的顺序,和(3)仅在较大的顺序的项目数。 这简化为:|A∪B| = N +Δ+∂。

By definition, then, |A∩B| = N-∂, and |A∪B| = (N-∂) + ∂ + (N+∆-(N-∂)), where the three added terms represent (1) the number of items in common between the two orders, (2) the number of items only in the smaller order, and (3) the number of items only in the larger order. This simplifies to: |A∪B| = N+∆+∂.

有关的相似部分女,我们感兴趣的是双的订单,其中的 J(下A,B)≥楼这样:

For a similarity fraction F, we're interested in pairs of orders where J(A,B) ≥ F, so:

(N-∂)÷(N +Δ+∂)≥˚F

(N-∂) ÷ (N+∆+∂) ≥ F

F≤(N-∂)÷(N +Δ+∂)

F ≤ (N-∂) ÷ (N+∆+∂)

我们可以使用作为preadsheet来绘制它们之间的关系。为一个 定在较小顺序(x轴)的项目的数目,以及对于给定的 相似性,我们可以绘制∂,它们给了我们一个最大值 F的相似度计算公式为:


We can use a spreadsheet to graph the relationship between these. For a given number of items in the smaller order (x-axis), and for a given similarity, we can graph the maximum value of ∂ that gives us a similarity of F. The formula is:

∂=(N(1-F) - FΔ)÷(1 + F)

∂ = (N(1-F) - F∆) ÷ (1+F)

这是在N和Δ为常数F线性方程;它是非线性的 对于F的不同值显然,∂必须是一个非负 整数。

This is a linear equation in N and ∆ for constant F; it is non-linear for different values of F. Clearly, ∂ has to be a non-negative integer.

鉴于F = 0.85,对于订单的大小相同(Δ= 0),其中1≤N'其中; 13,∂= 0; 13≤N'LT; 25,∂≤1; 25≤N'LT; 37,∂≤2, 37≤N'LT; 50,∂≤3。

Given F = 0.85, for orders that are the same size (∆=0), for 1 ≤ N < 13, ∂ = 0; for 13 ≤ N < 25, ∂ ≤ 1; for 25 ≤ N < 37, ∂ ≤ 2, for 37 ≤ N < 50, ∂ ≤ 3.

有关订单相差1(Δ= 1),对于1≤N'其中; 18,∂= 0; 18 ≤N'LT; 31,∂≤1; 31≤N'LT; 43,∂≤2;等等。如果Δ= 6,你 需要N = 47前的订单仍在85%的相似与∂= 1。那 指的小订单有47个项目,其中46个是共同的 大订单的53项。

For orders that differ by 1 (∆=1), for 1 ≤ N < 18, ∂ = 0; for 18 ≤ N < 31, ∂ ≤ 1; for 31 ≤ N < 43, ∂ ≤ 2; etc. If ∆=6, you need N=47 before the orders are still 85% similar with ∂=1. That means the small order has 47 items, of which 46 are in common with the large order of 53 items.

到目前为止,一切都很好。我们如何运用这一理论在选择订单 类似指定的顺序?

So far, so good. How can we apply that theory to selecting the orders similar to a specified order?

首先,我们观察到指定的顺序可能是相同的尺寸作为 类似的命令,或更大,或更小。这种复杂的事情了一下。

First, we observe that the specified order could be the same size as a similar order, or larger, or smaller. This complicates things a bit.

等式的参数以上是:

  • N - 项目在较小的订单号
  • Δ - 在更大的订单和N项目的数量差异
  • F - 固定
  • ∂ - 在更大的订单不匹配的小订单的产品数量

可以使用在顶部开发的查询微小变化的值:

The values available using minor variations on the queries developed at the top:

  • NC - 共同项目的数量
  • NA - 项目中指定的顺序号
  • NB - 项目的比较顺序号

相应的查询:

SELECT OrderID AS ID, COUNT(*) AS NA
  FROM OrderItem
 WHERE OrderID = <specified order ID>
 GROUP BY OrderID;

SELECT OrderID AS ID, COUNT(*) AS NB
  FROM OrderItem
 WHERE OrderID != <specified order ID>
 GROUP BY OrderID;

SELECT I1.OrderID AS ID, COUNT(*) AS NC
  FROM OrderItem AS I1
  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
 WHERE I1.OrderID != <specified order ID>
 GROUP BY I1.OrderID

为了方便起见,我们想要的值N和N +Δ(因此Δ)可用,所以 我们可以使用UNION安排适当的事情,有:

For convenience, we want the values N and N+∆ (and hence ∆) available, so we can use a UNION to arrange things appropriately, with:

  • 在NS = N - 项目在较小的订单号
  • 在NL = N +Δ - 项目在更大的订单号

和在所述第二版本的联合查询的有:

and in the second version of the UNION query, with:

  • 在NC​​ = N - ∂ - 共同
  • 项目数量

两个查询保持两个订单ID号,以便您可以跟踪回 的订单信息后,其余部分。

Both queries keep the two order ID numbers so that you can track back to the rest of the order information later.

SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA <= v2.NB
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA > v2.NB

这给了我们一张桌子前pression与列OrderID_1,NS,OrderID_2, NL,其中NS为项中的'小订单的数量和NL是 在大订单的产品数量。因为没有重叠在 由v1和v2表EX pressions生成的订单号,有没有 需要担心的反身性的条目,其中的OrderID值是 相同。添加NC这个是最容易在UNION查询过处理的:

This gives us a table expression with columns OrderID_1, NS, OrderID_2, NL, where NS is the number of items in the 'smaller order and NL is the number of items in the larger order. Since there is no overlap in the order numbers generated by the v1 and v2 table expressions, there's no need to worry about 'reflexive' entries where the OrderID values are the same. Adding NC to this is most easily handled in the UNION query too:

SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA <= v2.NB
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS v3
    ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA > v2.NB
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS v3
    ON v3.ID = v1.ID

这给了我们一张桌子前pression与列OrderID_1,NS,OrderID_2, NL,NC,其中NS是项目中的小订单的数量和NL是 在较大的顺序的项目数,和NC是项目数 在常见的。

This gives us a table expression with columns OrderID_1, NS, OrderID_2, NL, NC, where NS is the number of items in the 'smaller order and NL is the number of items in the larger order, and NC is the number of items in common.

由于NS,NL,NC,我们正在寻找能够满足订单:

Given NS, NL, NC, we are looking for orders that satisfy:

(N-∂)÷(N +Δ+∂)≥F。

(N-∂) ÷ (N+∆+∂) ≥ F.

  • N - 项目在较小的订单号
  • Δ - 在更大的订单和N项目的数量差异
  • F - 固定
  • ∂ - 在更大的订单不匹配的小订单的产品数量

  • N – number of items in smaller order
  • ∆ — difference between number of items in larger order and N
  • F — fixed
  • ∂ — number of items in smaller order not matched in larger order

    NS = N - 项目在较小的订单号

    NS = N — number of items in smaller order

    的条件,因此,需要:

    NC / (NL + (NS - NC)) ≥ F
    

    在LHS的术语必须被评价为浮点数,而不是作为 一个整数EX pression。上面的应用,为联合查询,收益率:

    The term on the LHS must be evaluated as a floating point number, not as an integer expression. Applying that to the UNION query above, yields:

    SELECT OrderID_1, NS, OrderID_2, NL, NC,
            CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
      FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
              FROM (SELECT OrderID AS ID, COUNT(*) AS NA
                      FROM OrderItem
                     WHERE OrderID = <specified order ID>
                     GROUP BY OrderID
                   ) AS v1
              JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
                      FROM OrderItem
                     WHERE OrderID != <specified order ID>
                     GROUP BY OrderID
                   ) AS v2
                ON v1.NA <= v2.NB
              JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
                      FROM OrderItem AS I1
                      JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
                     WHERE I1.OrderID != <specified order ID>
                     GROUP BY I1.OrderID
                   ) AS v3
                ON v3.ID = v2.ID
            UNION
            SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
              FROM (SELECT OrderID AS ID, COUNT(*) AS NA
                      FROM OrderItem
                     WHERE OrderID = <specified order ID>
                     GROUP BY OrderID
                   ) AS v1
              JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
                      FROM OrderItem
                     WHERE OrderID != <specified order ID>
                     GROUP BY OrderID
                   ) AS v2
                ON v1.NA > v2.NB
              JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
                      FROM OrderItem AS I1
                      JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
                     WHERE I1.OrderID != <specified order ID>
                     GROUP BY I1.OrderID
                   ) AS v3
                ON v3.ID = v1.ID
           ) AS u
     WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F
    

    您可能会注意到,这个查询只使用了订单项表;该 不需要订单和项目表。

    You might observe that this query only uses the OrderItem table; the Order and Item tables are not needed.

    警告:部分测试SQL(警告讲师)。现在,上面的SQL似乎产生的微小的数据集可信的答案。我调整了相似的要求(0.25,则0.55),并得到了合理的价值和适当的选择性。然而,我的测试数据有但8个项目中最大的顺序,当然也未覆盖所述数据的全部范围。因为我最经常使用的数据库管理系统不支持热膨​​胀系数,下面的SQL是未经检验的。不过,我中度信心,除非我犯了大错,热膨胀系数$ C $下1版(有很多指定的顺序号的重复)应保持清洁。我认为,第2版可能也没关系,但是......它是未经检验的。

    Warning: partially tested SQL (caveat lector). The SQL above now seems to produce plausible answers on minuscule data sets. I adjusted the similarity requirement (0.25, then 0.55) and got plausible values and appropriate selectivity. However, my test data had but 8 items in the biggest order, and certainly wasn't covering the full scope of the described data. Since the DBMS I use most frequently does not support CTEs, the SQL below is untested. However, I am moderately confident that unless I made a big mistake, the CTE code in version 1 (with lots of repetition of the specified order ID) should be clean. I think version 2 may be OK too, but...it is untested.

    有可能是前$ P $的pssing查询,可能使用更紧凑的方式 在OLAP功能。

    There may be more compact ways of expressing the query, possibly using the OLAP functions.

    如果我要测试这一点,我想创建一个表有几个 再presentative套订单的产品,检查相似性度量 返回是明智的。我的工作查询或多或少如图所示, 逐步建立了复杂的查询。如果EX pressions之一是 证明是有缺陷的,那么我会做在该查询适当的调整 直到这个安全漏洞是固定的。

    If I was going to test this, I'd create a table with a few representative sets of order items, checking that the similarity measure returned was sensible. I'd work the queries more or less as shown, gradually building up the complex query. If one of the expressions was shown to be flawed, then I'd make appropriate adjustments in that query until the flaw was fixed.

    显然,性能将是一个问题。最里面的查询不 可怕复杂,但它们不是wholy微不足道的。然而,测量 将显示是否这是一个巨大的问题,或者只是一个滋扰。留学 查询计划可能会有所帮助。它似乎很可能,应该有 在OrderItem.OrderID指数;查询不大可能表现良好 如果没有这样的索引。即不太可能是因为它的一个问题 是一个外键列。

    Clearly, performance will be an issue. The innermost queries are not dreadfully complex, but they aren't wholy trivial. However, measurement will show whether it's a dramatic problem or just a nuisance. Studying the query plans may help. It seems very probable that there should be an index on OrderItem.OrderID; the queries are unlikely to perform well if there isn't such an index. That is unlikely to be a problem since it is a foreign key column.

    您可能会得到一些好处了使用WITH子句(公用表前pressions的)。他们会作出明确的重复是隐含在UNION子查询的两半。

    You might get some benefit out of using 'WITH clauses' (Common Table Expressions). They would make explicit the repetition that is implicit in the two halves of the UNION sub-query.

    使用公用表EX pressions澄清,以优化时, EX pressions是相同的,并且可以帮助它更好的表现。他们也 帮助人们阅读您的查询。查询上述操作,而讨要 利用热膨胀系数的。

    Using common table expressions clarifies to the optimizer when expressions are the same, and may help it perform better. They also help the humans reading your query. The query above does rather beg for the use of CTEs.

    版本1:重复按照指定的顺序号

    Version 1: Repeating the specified order number

    WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA       -- Specified Order (SO)
                  FROM OrderItem
                 WHERE OrderID = <specified order ID>
                 GROUP BY OrderID
               ),
         OO AS (SELECT OrderID AS ID, COUNT(*) AS NB       -- Other orders (OO)
                  FROM OrderItem
                 WHERE OrderID != <specified order ID>
                 GROUP BY OrderID
               ),
         CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC    -- Common Items (CI)
                  FROM OrderItem AS I1
                  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
                 WHERE I1.OrderID != <specified order ID>
                 GROUP BY I1.OrderID
               )
    SELECT OrderID_1, NS, OrderID_2, NL, NC,
            CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
      FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
              FROM SO AS v1
              JOIN OO AS v2 ON v1.NA <= v2.NB
              JOIN CI AS v3 ON v3.ID  = v2.ID
            UNION
            SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
              FROM SO AS v1
              JOIN OO AS v2 ON v1.NA  > v2.NB
              JOIN CI AS v3 ON v3.ID  = v1.ID
           ) AS u
     WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F
    

    2版:避免重复按照指定的顺序号

    Version 2: Avoiding repeating the specified order number

    WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA       -- Specified Order (SO)
                  FROM OrderItem
                 WHERE OrderID = <specified order ID>
                 GROUP BY OrderID
               ),
         OO AS (SELECT OI.OrderID AS ID, COUNT(*) AS NB    -- Other orders (OO)
                  FROM OrderItem AS OI
                  JOIN SO ON OI.OrderID != SO.ID
                 GROUP BY OI.OrderID
               ),
         CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC    -- Common Items (CI)
                  FROM OrderItem AS I1
                  JOIN SO AS S1 ON I1.OrderID != S1.ID
                  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID
                  JOIN SO AS S2 ON I2.OrderID  = S2.ID
                 GROUP BY I1.OrderID
               )
    SELECT OrderID_1, NS, OrderID_2, NL, NC,
            CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
      FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
              FROM SO AS v1
              JOIN OO AS v2 ON v1.NA <= v2.NB
              JOIN CI AS v3 ON v3.ID  = v2.ID
            UNION
            SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
              FROM SO AS v1
              JOIN OO AS v2 ON v1.NA  > v2.NB
              JOIN CI AS v3 ON v3.ID  = v1.ID
           ) AS u
     WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F
    

    这些都不是很容易看懂;两者都更容易比写出全面的热膨胀系数大的选择。

    Neither of these is an easy read; both are easier than the big SELECT with the CTEs written out in full.

    这是不够好的测试。它提供了一个小点点的信心(它确实表现出了问题,与同一顺序查询。

    This is inadequate for good testing. It gives a small modicum of confidence (and it did show up the problem with the 'identical order' query.

    CREATE TABLE Order (ID SERIAL NOT NULL PRIMARY KEY);
    CREATE TABLE Item  (ID SERIAL NOT NULL PRIMARY KEY);
    CREATE TABLE OrderItem
    (
        OrderID INTEGER NOT NULL REFERENCES Order,
        ItemID INTEGER NOT NULL REFERENCES Item,
        Quantity DECIMAL(8,2) NOT NULL
    );
    
    INSERT INTO Order VALUES(1);
    INSERT INTO Order VALUES(2);
    INSERT INTO Order VALUES(3);
    INSERT INTO Order VALUES(4);
    INSERT INTO Order VALUES(5);
    INSERT INTO Order VALUES(6);
    INSERT INTO Order VALUES(7);
    
    INSERT INTO Item VALUES(111);
    INSERT INTO Item VALUES(222);
    INSERT INTO Item VALUES(333);
    INSERT INTO Item VALUES(444);
    INSERT INTO Item VALUES(555);
    INSERT INTO Item VALUES(666);
    INSERT INTO Item VALUES(777);
    INSERT INTO Item VALUES(888);
    INSERT INTO Item VALUES(999);
    
    INSERT INTO OrderItem VALUES(1, 111, 1);
    INSERT INTO OrderItem VALUES(1, 222, 1);
    INSERT INTO OrderItem VALUES(1, 333, 1);
    INSERT INTO OrderItem VALUES(1, 555, 1);
    
    INSERT INTO OrderItem VALUES(2, 111, 1);
    INSERT INTO OrderItem VALUES(2, 222, 1);
    INSERT INTO OrderItem VALUES(2, 333, 1);
    INSERT INTO OrderItem VALUES(2, 555, 1);
    
    INSERT INTO OrderItem VALUES(3, 111, 1);
    INSERT INTO OrderItem VALUES(3, 222, 1);
    INSERT INTO OrderItem VALUES(3, 333, 1);
    INSERT INTO OrderItem VALUES(3, 444, 1);
    INSERT INTO OrderItem VALUES(3, 555, 1);
    INSERT INTO OrderItem VALUES(3, 666, 1);
    
    INSERT INTO OrderItem VALUES(4, 111, 1);
    INSERT INTO OrderItem VALUES(4, 222, 1);
    INSERT INTO OrderItem VALUES(4, 333, 1);
    INSERT INTO OrderItem VALUES(4, 444, 1);
    INSERT INTO OrderItem VALUES(4, 555, 1);
    INSERT INTO OrderItem VALUES(4, 777, 1);
    
    INSERT INTO OrderItem VALUES(5, 111, 1);
    INSERT INTO OrderItem VALUES(5, 222, 1);
    INSERT INTO OrderItem VALUES(5, 333, 1);
    INSERT INTO OrderItem VALUES(5, 444, 1);
    INSERT INTO OrderItem VALUES(5, 555, 1);
    INSERT INTO OrderItem VALUES(5, 777, 1);
    INSERT INTO OrderItem VALUES(5, 999, 1);
    
    INSERT INTO OrderItem VALUES(6, 111, 1);
    INSERT INTO OrderItem VALUES(6, 222, 1);
    INSERT INTO OrderItem VALUES(6, 333, 1);
    INSERT INTO OrderItem VALUES(6, 444, 1);
    INSERT INTO OrderItem VALUES(6, 555, 1);
    INSERT INTO OrderItem VALUES(6, 777, 1);
    INSERT INTO OrderItem VALUES(6, 888, 1);
    INSERT INTO OrderItem VALUES(6, 999, 1);
    
    INSERT INTO OrderItem VALUES(7, 111, 1);
    INSERT INTO OrderItem VALUES(7, 222, 1);
    INSERT INTO OrderItem VALUES(7, 333, 1);
    INSERT INTO OrderItem VALUES(7, 444, 1);
    INSERT INTO OrderItem VALUES(7, 555, 1);
    INSERT INTO OrderItem VALUES(7, 777, 1);
    INSERT INTO OrderItem VALUES(7, 888, 1);
    INSERT INTO OrderItem VALUES(7, 999, 1);
    INSERT INTO OrderItem VALUES(7, 666, 1);
    

    这篇关于如何选择一组类似的SQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆