SQL中的组合优化匹配 [英] Combinatorial Optimization Matching in SQL

查看:134
本文介绍了SQL中的组合优化匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法在SQL中开发匹配算法。我有一张表科目。每个这些需要匹配到表控件中的相同数量的行(为了这个问题,我们假设需要为每个主题选择两行或控件)。所选控件的位置必须完全匹配,并且所选的控件应具有尽可能接近主题的 match_field 中的值。



以下是一些示例数据:



表主题:

  id location match_field 
1 1 190
2 2 2000
3 1 100

表格控件:

  id位置match_field 
17 1 70
11 1 180
12 1 220
13 1 240
14 1 500
15 1 600
16 1 600
10 2 30
78 2 1840
79 2 2250

这将是样本数据的最佳结果: / p>

  subject_id control_id location match_field_diff 
1 12 1 30
1 13 1 50
2 78 2 160
2 79 2 250
3 17 1 30
3 11 1 80

它变得棘手,因为例如,控件11是与主题1最接近的匹配。但是,在最优解中,控件11与主题3相匹配。



我相信匈牙利算法接近于正确的解决这个问题。然而,没有相同数量的科目和控制,也不会使用所有控制(我有几千个科目和几百万个潜在的控制)。



它没有必要获得绝对最佳结果;一个很好的近似值对我来说是很好的。



似乎应该有一个很好的基于set的解决方案来解决这个问题,但我不能想像如何做到这一点以下是一些根据位置为每个主题分配相同数量的控件的代码:

  select * from(
选择subject.id,
control.id,
subject.location,
row_number()over(
partition by subject.location
order by subject.id, control.id
)as rn,
count(distinct control.id)over(
partition by subject.location
)as controls_in_loc
from subjects
加入控件control.location = subject.location

其中mod(rn,controls_in_loc + 1)= 1

但是,我不知道如何添加模糊匹配组件。我正在使用DB2,但如果您使用其他东西,可以将算法转换为DB2。



提前感谢您的帮助!



更新:我深信SQL不是适合这项工作的工具。但是,为了确保(因为这是一个有趣的问题),我提供了一个赏金,看看一个可行的SQL解决方案是否可行。它需要一个基于集的解决方案。它可以使用迭代(循环遍历相同的查询多次以实现结果),但迭代次数需要远小于大表的行数。它不应该遍历表中的每个元素或使用光标。

解决方案

虽然匈牙利算法正在运行,在您的情况下可以使用更简单的算法。您的隐式成本矩阵是一种特殊形式的对称矩阵:

  ABS(SUBJ.match_field-CTRL.match_field)

因此,您可以相对容易地证明在最优分配中 {SUBJ i ,CTRL j } SUBJ.match_field 排序 CTRL.match_field 也将被订购。



证明: 考虑一个作业 {SUBJ SUBJMatch_field 排序的 ,CTRL j } CTRL.match_field 。那么你至少有一个反转,即一对分配 {SUBJ i1 ,CTRL j1 } em> {SUBJ i2 ,CTRL j2 } ,使



SUBJ.match_field i1 < SUBJ.match_field i2 ,但



CTRL.match_field j1 > CTRL.match_field j2



然后,您可以用非反向替换倒置对



{SUBJ i1 ,CTRL j2 } {SUBJ i2 ,CTRL j1 }



小于或等于 SUBJ.match_field i1 )的所有六个相对展示位置的反转分配成本的成本 i2 )和 CTRL.match_field j1 j2 ) a href =http://www.wolframalpha.com/input/?i=b%3Ea,B%3CA,abs%28a-A%29%2babs%28b-B%29%3E=abs%28a-B %29%2babs%28b-A%29rel =noreferrer>链接到Wolfram Alpha )。 :证明



有了这个观察结果,很容易证明 动态编程 算法可以得出最佳赋值:




  • 使 N 每个主题的重复;订单由 match_field

  • 订单控制按 match_field

  • 准备一个空数组作业的大小 N * subject.SIZE

  • 准备一个空的2D数组 mem 大小 N * subject.SIZE memoization 的control.SIZE 将所有元素设置为 -1

  • 调用 Recursive_Assign / li>
  • 作业表格现在包含 N 每个主题的作业<$ c $在 N * i (含)和 N *(i + 1) i $ c>,独家。






  FUNCTION Recursive_Assign 
//主题包含每个原始subj重复N次
PARAM主题:int [subjectLength]的数组
PARAM控件:int [controlLength]的数组
PARAM mem:array的int [subjectLength,controlLength]
PARAM sp:int //当前主题位置
PARAM cp:int //当前控制位置
PARAM assign:int [subjectLength]的数组
BEGIN
IF sp == subjects.Length THEN RETURN 0 ENDIF
IF mem [sp,cp]> 0 THEN RETURN mem [sp,cp] ENDIF
int res = ABS(subject [sp] - controls [cp])
+ Recursive_Assign(subject,controls,mem,sp + 1,cp + assign)
assign [sp] = cp
IF cp + 1 + subjects.Length-sp< control.Length THEN
int alt = Recursive_Assign(subject,controls,mem,sp,cp + 1,assign)
如果alt< res THEN
res = alt
ELSE
assign [sp] = cp
ENDIF
ENDIF
返回(mem [sp,cp] = res)
END






以下是使用C#在ideone上的上述伪代码的实现。



此算法可以在SQL中重新编写为基于set的。尝试将其适应原始的问题设置(通过位置分组和创建主题的多个副本)会为已经相当复杂的过程增加不必要的复杂程度,因此我将通过使用表格来简化事情SQL Server的值参数。我不确定DB2是否提供类似的功能,但如果没有,您应该可以用临时表替换它们。



下面的存储过程几乎是直接的将上述伪代码转换为SQL Server的存储过程语法:

  CREATE TYPE SubjTableType AS TABLE(row int,id int,match_field int)
CREATE TYPE Con​​trolTableType AS TABLE(row int,id int,match_field int)
CREATE PROCEDURE RecAssign(
@subjects SubjTableType READONLY
,@controls ControlTableType READONLY
,@sp int
,@cp int
,@subjCount int
,@ctrlCount int
)AS BEGIN
IF @sp = @subjCount BEGIN
返回0
END
IF 1 =(SELECT COUNT(1)FROM #MemoTable WHERE sRow = @ sp AND cRow = @ cp)BEGIN
RETURN(SELECT best FROM #MemoTable WHERE sRow = @ sp AND cRow = @ cp)
END
DECLARE @res int,@spNext int,@cpNext int,@prelim int,@alt int,@di ff int,@sId int,@cId int
SET @spNext = @sp + 1
SET @cpNext = @cp + 1
SET @sId =(SELECT id FROM @subjects WHERE row = @sp)
SET @cId =(SELECT id FROM @controls WHERE row = @cp)
EXEC @prelim = RecAssign @ subjects = @ subject,@ controls = @ controls,@ sp = @ spNext ,@ cp = @ cpNext,@ subjCount = @ subjCount,@ ctrlCount = @ ctrlCount
SET @diff = ABS((SELECT match_field FROM @subjects WHERE row = @ sp) - (SELECT match_field FROM @controls WHERE row = @cp))
SET @res = @prelim + @diff
IF 1 =(SELECT COUNT(1)FROM #Assignments WHERE sRow = @ sp)BEGIN
更新#Assignments SET cId = @cId,sId = @ sId,diff = @ diff WHERE sRow = @ sp
END
ELSE BEGIN
INSERT INTO #Assignments(sRow,sId,cId,diff)VALUES(@sp, @sId,@cId,@diff)
END
IF @ cp + 1 + @ subjCount- @ sp< @ctrlCount BEGIN
EXEC @alt = RecAssign @ subjects = @ subject,@ controls = @ controls,@ sp = @ sp,@ cp = @ cpNext,@ subjCount = @ subjCount,@ ctrlCount = @ ctrlCount
IF @alt @res BEGIN
SET @res = @alt
END
ELSE BEGIN
更新#Assignments SET cId = @ cId,sId = @ sId,diff = @ diff WHERE sRow = @ sp
END
END
INSERT INTO #MemoTable(sRow,cRow,best)VALUES(@sp,@cp,@res)
RETURN @res
END

这是您如何调用此存储过程:

   - 该过程使用临时表进行记忆:
CREATE TABLE #MemoTable(sRow int,cRow int,best int)
- 过程返回具有赋值的表:
CREATE TABLE #Assignments(sRow int,sId int,cId int,diff int)

DECLARE @subj作为SubjTableType
INSERT INTO @SUBJ(row,id ,match_field)SELECT ROW_NUMBER()OVER(ORDER BY match_field ASC)-1 AS row,id,match_field FROM subjects
DECLARE @ctrl as ControlTableType
INSERT INTO @ctrl(row,id,match_field)SELECT ROW_NUMBER ()OVER(ORDER BY match_field ASC)-1 AS row,id,match_field FROM控制
DECLARE @subjCount int
SET @subjCount =(SELECT COUNT(1)FROM subjects)
DECLARE @ctrlCount int
SET @ctrlCount =(SELECT COUNT(1)FROM controls )
DECLARE @best int
EXEC @best = RecAssign
@ subjects = @ subj
,@ controls = @ ctrl
,@ sp = 0
,@ cp = 0
,@ subjCount = @ subjCount
,@ ctrlCount = @ ctrlCount
SELECT @best
SELECT sId,cId,diff FROM #Assignments

上面的调用假定主题控件已被位置过滤,并且已插入主题 N 副本进入表值参数(或DB2中的临时表)。



这里是在sqlfiddle上运行的演示


I am having trouble developing a matching algorithm in SQL. I have one table subjects. Each of these needs to be matched to the same number of rows in the table controls (for the sake of this question let's say two rows or controls need to be selected for each subject). The location of the selected controls must match exactly, and the controls selected should have a value in match_field that is as close as possible to the subjects.

Here is some sample data:

Table subjects:

id   location    match_field
1    1           190
2    2           2000
3    1           100

Table controls:

id   location    match_field
17    1          70
11    1          180
12    1          220
13    1          240
14    1          500
15    1          600
16    1          600
10    2          30
78    2          1840
79    2          2250

Here would be the optimum result from the sample data:

subject_id control_id  location    match_field_diff
1          12          1           30
1          13          1           50
2          78          2           160
2          79          2           250
3          17          1           30
3          11          1           80

It gets tricky, because, for example, control 11 is the closest match to subject 1. However, in the optimum solution control 11 is matched to subject 3.

I believe the Hungarian Algorithm is close to the "correct" solution to this problem. However, there is not an equal number of subjects and controls, nor will all controls be used (I have a few thousand subjects and a few million potential controls).

It is not necessary to obtain the absolute optimum results; a pretty good approximation would be fine with me.

It seems that there should be a nice set-based solution to this problem, but I can't think of how to do it. Here is some code that assigns an equal number of controls to each subject based on location only:

select * from (
    select   subject.id, 
             control.id,
             subject.location,
             row_number() over (
                 partition by subject.location
                 order by subject.id, control.id
             ) as rn,
             count(distinct control.id)     over (
                 partition by subject.location
             ) as controls_in_loc
         from subjects
         join controls on control.location = subject.location
    )
    where mod(rn,controls_in_loc+1) = 1

However, I can't figure out how to add the fuzzy matching component. I am using DB2 but can convert an algorithm into DB2 if you are using something else.

Thanks in advance for your help!

Update: I am mostly convinced that SQL is not the right tool for this job. However, just to be sure (and because it is an interesting problem), I am offering a bounty to see if a working SQL solution is possible. It needs to be a set-based solution. It can use iteration (looping over the same query multiple times to achieve the result) but number of iterations needs to be far less than the number of rows for a large table. It should not loop over each element in the table or use cursors.

解决方案

Although the Hungarian Algorithm is going to work, a much simpler algorithm can be used in your case. Your implicit cost matrix is a symmetric matrix of a special form:

ABS(SUBJ.match_field-CTRL.match_field)

Therefore, you can relatively easily prove that in an optimal assignment {SUBJi, CTRLj} ordered by SUBJ.match_field the values of CTRL.match_field will be ordered as well.

Proof: Consider an assignment {SUBJi, CTRLj} ordered by SUBJ.match_field that is not ordered by CTRL.match_field. Then you have at least one inversion, i.e. a pair of assignments {SUBJi1, CTRLj1} and {SUBJi2, CTRLj2} such that

SUBJ.match_fieldi1 < SUBJ.match_fieldi2, but

CTRL.match_fieldj1 > CTRL.match_fieldj2

Then you can replace the inverted pair with a non-inverted one

{SUBJi1, CTRLj2} and {SUBJi2, CTRLj1}

of a cost that is less than or equal to the cost of the inverted assignment for all six relative placements of SUBJ.match_field(i1, i2) and CTRL.match_field(j1, j2) (link to Wolfram Alpha). :Proof

With this observation in hand, it is easy to prove that the dynamic programming algorithm below comes up with the optimal assignment:

  • Make N duplicates of each subject; order by match_field
  • Order controls by match_field
  • Prepare an empty array assignments of size N * subject.SIZE
  • Prepare an empty 2D array mem of size N * subject.SIZE by control.SIZE for memoization; set all elements to -1
  • Call Recursive_Assign defined in pseudocode below
  • The assignments table now contains N assignments for each subject i at positions between N*i, inclusive, and N*(i+1), exclusive.

FUNCTION Recursive_Assign
    // subjects contains each original subj repeated N times
    PARAM subjects : array of int[subjectLength]
    PARAM controls: array of int[controlLength]
    PARAM mem : array of int[subjectLength,controlLength]
    PARAM sp : int // current subject position
    PARAM cp : int // current control position
    PARAM assign : array of int[subjectLength]
BEGIN
    IF sp == subjects.Length THEN RETURN 0 ENDIF
    IF mem[sp, cp] > 0 THEN RETURN mem[sp, cp] ENDIF
    int res = ABS(subjects[sp] - controls[cp])
            + Recursive_Assign(subjects, controls, mem, sp + 1, cp + 1, assign)
    assign[sp] = cp
    IF cp+1+subjects.Length-sp < controls.Length THEN
        int alt = Recursive_Assign(subjects, controls, mem, sp, cp + 1, assign)
        IF alt < res THEN
            res = alt
        ELSE
            assign[sp] = cp
        ENDIF
    ENDIF
    RETURN (mem[sp, cp] = res)
END


Here is an implementation of the above pseudocode using C# on ideone.

This algorithm is ready to be re-written as set-based in SQL. Trying to fit it into the original problem setting (with grouping by locations and making multiple copies of the subject) would add unnecessary layer of complexity to a procedure that is already rather complex, so I am going to simplify things quite a bit by using table-valued parameters of SQL Server. I am not sure if DB2 provides similar capabilities, but if it does not, you should be able to replace them with temporary tables.

The stored procedure below is a nearly direct transcription of the above pseudocode into SQL Server's syntax for stored procedures:

CREATE TYPE SubjTableType AS TABLE (row int, id int, match_field int)
CREATE TYPE ControlTableType AS TABLE (row int, id int, match_field int)
CREATE PROCEDURE RecAssign (
    @subjects SubjTableType READONLY
,   @controls ControlTableType READONLY
,   @sp int
,   @cp int
,   @subjCount int
,   @ctrlCount int
) AS BEGIN
    IF @sp = @subjCount BEGIN
        RETURN 0
    END
    IF 1 = (SELECT COUNT(1) FROM #MemoTable WHERE sRow=@sp AND cRow=@cp) BEGIN
        RETURN (SELECT best FROM #MemoTable WHERE sRow=@sp AND cRow=@cp)
    END
    DECLARE @res int, @spNext int, @cpNext int, @prelim int, @alt int, @diff int, @sId int, @cId int
    SET @spNext = @sp + 1
    SET @cpNext = @cp + 1
    SET @sId = (SELECT id FROM @subjects WHERE row = @sp)
    SET @cId = (SELECT id FROM @controls WHERE row = @cp)
    EXEC @prelim = RecAssign @subjects=@subjects, @controls=@controls, @sp=@spNext, @cp=@cpNext, @subjCount=@subjCount, @ctrlCount=@ctrlCount
    SET @diff = ABS((SELECT match_field FROM @subjects WHERE row=@sp)-(SELECT match_field FROM @controls WHERE row=@cp))
    SET @res = @prelim + @diff
    IF 1 = (SELECT COUNT(1) FROM #Assignments WHERE sRow=@sp) BEGIN
        UPDATE #Assignments SET cId=@cId, sId=@sId, diff=@diff WHERE sRow=@sp
    END
    ELSE BEGIN
        INSERT INTO #Assignments(sRow, sId, cId, diff) VALUES (@sp, @sId, @cId, @diff)
    END
    IF @cp+1+@subjCount-@sp < @ctrlCount BEGIN
        EXEC @alt = RecAssign @subjects=@subjects, @controls=@controls, @sp=@sp, @cp=@cpNext, @subjCount=@subjCount, @ctrlCount=@ctrlCount
        IF @alt < @res BEGIN
            SET @res = @alt
        END
        ELSE BEGIN
            UPDATE #Assignments SET cId=@cId, sId=@sId, diff=@diff WHERE sRow=@sp
        END
    END
    INSERT INTO #MemoTable (sRow, cRow, best) VALUES (@sp, @cp, @res)
    RETURN @res
END

Here is how you call this stored procedure:

-- The procedure uses a temporary table for memoization:
CREATE TABLE #MemoTable (sRow int, cRow int, best int)
-- The procedure returns a table with assignments:
CREATE TABLE #Assignments (sRow int, sId int, cId int, diff int)

DECLARE @subj as SubjTableType
INSERT INTO @SUBJ (row, id, match_field) SELECT ROW_NUMBER() OVER(ORDER BY match_field ASC)-1 AS row, id, match_field FROM subjects
DECLARE @ctrl as ControlTableType
INSERT INTO @ctrl (row, id, match_field) SELECT ROW_NUMBER() OVER(ORDER BY match_field ASC)-1 AS row, id, match_field FROM controls
DECLARE @subjCount int
SET @subjCount = (SELECT COUNT(1) FROM subjects)
DECLARE @ctrlCount int
SET @ctrlCount = (SELECT COUNT(1) FROM controls)
DECLARE @best int
EXEC @best = RecAssign
    @subjects=@subj
,   @controls=@ctrl
,   @sp=0
,   @cp=0
,   @subjCount=@subjCount
,   @ctrlCount=@ctrlCount
SELECT @best
SELECT sId, cId, diff FROM #Assignments

The call above assumes that both subjects and controls have been filtered by location, and that N copies of subjects has been inserted into the table-valued parameter (or the temp table in case of DB2) before making the call.

Here is a running demo on sqlfiddle.

这篇关于SQL中的组合优化匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆