在SQL组合优化匹配 [英] Combinatorial Optimization Matching in SQL

查看:182
本文介绍了在SQL组合优化匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法开发SQL中的匹配算法。我有一个表科目。所有这些需求进行匹配,以相同的行数表中的控件(为了这个问题,让我们说两行或控件需要分别选择主题)。选定控件的位置必须完全匹配,并选择了控制应在 match_field 的值是尽可能接近拍摄对象。

I am having trouble developing a matching algorithm in SQL. I have one table subjects. Each of these needs to be matched to the same number of rows in the table controls (for the sake of this question let's say two rows or controls need to be selected for each subject). The location of the selected controls must match exactly, and the controls selected should have a value in match_field that is as close as possible to the subjects.

下面是一些样本数据:

表科目:

id   location    match_field
1    1           190
2    2           2000
3    1           100

表控件:

id   location    match_field
17    1          70
11    1          180
12    1          220
13    1          240
14    1          500
15    1          600
16    1          600
10    2          30
78    2          1840
79    2          2250

下面将是从样本数据的最佳结果:

Here would be the optimum result from the sample data:

subject_id control_id  location    match_field_diff
1          12          1           30
1          13          1           50
2          78          2           160
2          79          2           250
3          17          1           30
3          11          1           80

它变得非常棘手,因为,​​例如,控制11最匹配拍摄对象1。然而,在最佳的解决方案控制11匹配到科目三。

It gets tricky, because, for example, control 11 is the closest match to subject 1. However, in the optimum solution control 11 is matched to subject 3.

我相信匈牙利算法是接近正确的解决了这个问题。然而,有不相等数量的受试者和控制,也不会所有控制被使用(我有几千受试者和几百万潜在​​控制)。

I believe the Hungarian Algorithm is close to the "correct" solution to this problem. However, there is not an equal number of subjects and controls, nor will all controls be used (I have a few thousand subjects and a few million potential controls).

有没有必要以获得绝对最佳的结果;一个pretty的良好近似就可以了我。

It is not necessary to obtain the absolute optimum results; a pretty good approximation would be fine with me.

这似乎应该有一个很好的集合为基础的解决了这个问题,但我想不出如何做到这一点。下面是一些code,它赋予了相同数量的控制,以基于位置的每个主题只有:

It seems that there should be a nice set-based solution to this problem, but I can't think of how to do it. Here is some code that assigns an equal number of controls to each subject based on location only:

select * from (
    select   subject.id, 
             control.id,
             subject.location,
             row_number() over (
                 partition by subject.location
                 order by subject.id, control.id
             ) as rn,
             count(distinct control.id)     over (
                 partition by subject.location
             ) as controls_in_loc
         from subjects
         join controls on control.location = subject.location
    )
    where mod(rn,controls_in_loc+1) = 1

不过,我无法弄清楚如何添加模糊匹配组件。我使用DB2,但可以,如果你使用的是别的东西转换算法到DB2。

However, I can't figure out how to add the fuzzy matching component. I am using DB2 but can convert an algorithm into DB2 if you are using something else.

在此先感谢您的帮助!

更新:我大多相信,SQL是不正确的工具,这项工作。然而,只是要确定(因为它是一个有趣的问题),我提供赏金,看是否有工作SQL的解决方案是可能的。它需要一组为基础的解决方案。它可以使用循环(循环在同一个查询中多次取得的结果),但迭代次数需要远远大于行的一大桌子的数量较少。它不应该遍历表或使用光标的每个元素。

Update: I am mostly convinced that SQL is not the right tool for this job. However, just to be sure (and because it is an interesting problem), I am offering a bounty to see if a working SQL solution is possible. It needs to be a set-based solution. It can use iteration (looping over the same query multiple times to achieve the result) but number of iterations needs to be far less than the number of rows for a large table. It should not loop over each element in the table or use cursors.

推荐答案

虽然匈牙利算法是去工作,一个更简单的算法,可以在你的情况下使用。您的隐含成本矩阵是一种特殊形式的对称矩阵

Although the Hungarian Algorithm is going to work, a much simpler algorithm can be used in your case. Your implicit cost matrix is a symmetric matrix of a special form:

ABS(SUBJ.match_field-CTRL.match_field)

因此​​,可以相对容易地证明,在最优分配的 {SUBJ <子>我,CTRL <子>Ĵ} 的订购 SUBJ.match_field CTRL.match_field 将责令也是如此。

Therefore, you can relatively easily prove that in an optimal assignment {SUBJi, CTRLj} ordered by SUBJ.match_field the values of CTRL.match_field will be ordered as well.

证明: 考虑转让的 {SUBJ <子>我,CTRL <子>Ĵ} 的通过 SUBJ.match_field 不是由 CTRL.match_field 有序排列。然后,你必须至少有一个反转的,即一对指派的 {SUBJ <子> I1 ,CTRL <子> J1 } {SUBJ <子> I2 ,CTRL <子> J2 } 的这样

Proof: Consider an assignment {SUBJi, CTRLj} ordered by SUBJ.match_field that is not ordered by CTRL.match_field. Then you have at least one inversion, i.e. a pair of assignments {SUBJi1, CTRLj1} and {SUBJi2, CTRLj2} such that

SUBJ.match_field <子> I1 &LT; SUBJ.match_field <子> I2 ,而

CTRL.match_field <子> J1 > CTRL.match_field <子> J2

然后你可以用一个非反相替换倒对

Then you can replace the inverted pair with a non-inverted one

{SUBJ <子> I1 ,CTRL <子> J2 } {SUBJ <子> I2 ,CTRL <子> J1 }

{SUBJi1, CTRLj2} and {SUBJi2, CTRLj1}

成本小于或等于倒分配的 SUBJ.match_field 的所有六个相对配售(<子> I1 的费用,<子> I2 )和 CTRL.match_field (<子> J1 ,<子> J2 )(<一href="http://www.wolframalpha.com/input/?i=b%3Ea,B%3CA,abs%28a-A%29%2babs%28b-B%29%3E=abs%28a-B%29%2babs%28b-A%29"相对=nofollow>链接到Wolfram Alpha的)。 :证明

of a cost that is less than or equal to the cost of the inverted assignment for all six relative placements of SUBJ.match_field(i1, i2) and CTRL.match_field(j1, j2) (link to Wolfram Alpha). :Proof

通过这个观察在手,很容易证明动态规划算法下面来与最优分配:

With this observation in hand, it is easy to prove that the dynamic programming algorithm below comes up with the optimal assignment:

  • N 各学科的复印件;为了通过 match_field
  • 将订单控制 match_field
  • prepare空数组任务尺寸 N * subject.SIZE
  • prepare一个空二维数组存储尺寸 N * subject.SIZE control.SIZE 记忆化 ;将所有元素 1
  • 呼叫 Recursive_Assign 下面
  • 伪code定义
  • 任务表中现在包括每个科目 N 分配我的位置> N *我,包容和 N *(I + 1),排他的。
  • Make N duplicates of each subject; order by match_field
  • Order controls by match_field
  • Prepare an empty array assignments of size N * subject.SIZE
  • Prepare an empty 2D array mem of size N * subject.SIZE by control.SIZE for memoization; set all elements to -1
  • Call Recursive_Assign defined in pseudocode below
  • The assignments table now contains N assignments for each subject i at positions between N*i, inclusive, and N*(i+1), exclusive.
FUNCTION Recursive_Assign
    // subjects contains each original subj repeated N times
    PARAM subjects : array of int[subjectLength]
    PARAM controls: array of int[controlLength]
    PARAM mem : array of int[subjectLength,controlLength]
    PARAM sp : int // current subject position
    PARAM cp : int // current control position
    PARAM assign : array of int[subjectLength]
BEGIN
    IF sp == subjects.Length THEN RETURN 0 ENDIF
    IF mem[sp, cp] > 0 THEN RETURN mem[sp, cp] ENDIF
    int res = ABS(subjects[sp] - controls[cp])
            + Recursive_Assign(subjects, controls, mem, sp + 1, cp + 1, assign)
    assign[sp] = cp
    IF cp+1+subjects.Length-sp < controls.Length THEN
        int alt = Recursive_Assign(subjects, controls, mem, sp, cp + 1, assign)
        IF alt < res THEN
            res = alt
        ELSE
            assign[sp] = cp
        ENDIF
    ENDIF
    RETURN (mem[sp, cp] = res)
END


下面是使用C#在ideone上面的伪code的实现。


Here is an implementation of the above pseudocode using C# on ideone.

此算法准备好被重新写为设定基于SQL中。试图将它变成原来的问题设置(与分组通过位置,使拍摄对象的多个副本)将增加复杂性不必要的层一个过程,已​​经是相当复杂的,所以我准备了不少使用表把事情简单化SQL服务器 - 值参数。我不知道,如果DB2提供了类似的功能,但如果没有,你应该能够将它们与临时表代替。

This algorithm is ready to be re-written as set-based in SQL. Trying to fit it into the original problem setting (with grouping by locations and making multiple copies of the subject) would add unnecessary layer of complexity to a procedure that is already rather complex, so I am going to simplify things quite a bit by using table-valued parameters of SQL Server. I am not sure if DB2 provides similar capabilities, but if it does not, you should be able to replace them with temporary tables.

下面的存储过程是上述伪code到SQL Server的语法为存储过程几乎直接转录:

The stored procedure below is a nearly direct transcription of the above pseudocode into SQL Server's syntax for stored procedures:

CREATE TYPE SubjTableType AS TABLE (row int, id int, match_field int)
CREATE TYPE ControlTableType AS TABLE (row int, id int, match_field int)
CREATE PROCEDURE RecAssign (
    @subjects SubjTableType READONLY
,   @controls ControlTableType READONLY
,   @sp int
,   @cp int
,   @subjCount int
,   @ctrlCount int
) AS BEGIN
    IF @sp = @subjCount BEGIN
        RETURN 0
    END
    IF 1 = (SELECT COUNT(1) FROM #MemoTable WHERE sRow=@sp AND cRow=@cp) BEGIN
        RETURN (SELECT best FROM #MemoTable WHERE sRow=@sp AND cRow=@cp)
    END
    DECLARE @res int, @spNext int, @cpNext int, @prelim int, @alt int, @diff int, @sId int, @cId int
    SET @spNext = @sp + 1
    SET @cpNext = @cp + 1
    SET @sId = (SELECT id FROM @subjects WHERE row = @sp)
    SET @cId = (SELECT id FROM @controls WHERE row = @cp)
    EXEC @prelim = RecAssign @subjects=@subjects, @controls=@controls, @sp=@spNext, @cp=@cpNext, @subjCount=@subjCount, @ctrlCount=@ctrlCount
    SET @diff = ABS((SELECT match_field FROM @subjects WHERE row=@sp)-(SELECT match_field FROM @controls WHERE row=@cp))
    SET @res = @prelim + @diff
    IF 1 = (SELECT COUNT(1) FROM #Assignments WHERE sRow=@sp) BEGIN
        UPDATE #Assignments SET cId=@cId, sId=@sId, diff=@diff WHERE sRow=@sp
    END
    ELSE BEGIN
        INSERT INTO #Assignments(sRow, sId, cId, diff) VALUES (@sp, @sId, @cId, @diff)
    END
    IF @cp+1+@subjCount-@sp < @ctrlCount BEGIN
        EXEC @alt = RecAssign @subjects=@subjects, @controls=@controls, @sp=@sp, @cp=@cpNext, @subjCount=@subjCount, @ctrlCount=@ctrlCount
        IF @alt < @res BEGIN
            SET @res = @alt
        END
        ELSE BEGIN
            UPDATE #Assignments SET cId=@cId, sId=@sId, diff=@diff WHERE sRow=@sp
        END
    END
    INSERT INTO #MemoTable (sRow, cRow, best) VALUES (@sp, @cp, @res)
    RETURN @res
END

下面是你如何调用这个存储过程:

Here is how you call this stored procedure:

-- The procedure uses a temporary table for memoization:
CREATE TABLE #MemoTable (sRow int, cRow int, best int)
-- The procedure returns a table with assignments:
CREATE TABLE #Assignments (sRow int, sId int, cId int, diff int)

DECLARE @subj as SubjTableType
INSERT INTO @SUBJ (row, id, match_field) SELECT ROW_NUMBER() OVER(ORDER BY match_field ASC)-1 AS row, id, match_field FROM subjects
DECLARE @ctrl as ControlTableType
INSERT INTO @ctrl (row, id, match_field) SELECT ROW_NUMBER() OVER(ORDER BY match_field ASC)-1 AS row, id, match_field FROM controls
DECLARE @subjCount int
SET @subjCount = (SELECT COUNT(1) FROM subjects)
DECLARE @ctrlCount int
SET @ctrlCount = (SELECT COUNT(1) FROM controls)
DECLARE @best int
EXEC @best = RecAssign
    @subjects=@subj
,   @controls=@ctrl
,   @sp=0
,   @cp=0
,   @subjCount=@subjCount
,   @ctrlCount=@ctrlCount
SELECT @best
SELECT sId, cId, diff FROM #Assignments

上面的通话假定两个科目控件已过滤的位置,而<$ C $对主题ç> N 复制已经被插入到呼叫前的表值参数(或在DB2的临时表)。

The call above assumes that both subjects and controls have been filtered by location, and that N copies of subjects has been inserted into the table-valued parameter (or the temp table in case of DB2) before making the call.

下面是上sqlfiddle 运行演示。

这篇关于在SQL组合优化匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆