SQL中Delta E(CIE Lab)计算和排序的性能 [英] Performance of Delta E (CIE Lab) calculating and sorting in SQL

查看:183
本文介绍了SQL中Delta E(CIE Lab)计算和排序的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据库表,其中每一行都是一种颜色。我的目标:给定输入颜色,计算其与数据库表中每种颜色的距离,然后按该距离对结果进行排序。或者说,作为用户故事:当我选择一种颜色时,我想查看与我选择的颜色最相似的颜色列表,并且最接近的匹配项位于列表顶部。



我了解到,为此,各种 Delta E (CIE实验室)公式是最佳选择。我找不到该公式的任何本机SQL实现,因此我写了自己的 Delta E CIE 1976 Delta E CIE 2000 。我对照 python-colormath

1976年的公式很容易用SQL或任何其他语言编写,因为它是一个简单的欧几里德距离计算。在任何大小的数据集上,它对我而言都表现出色且快速(在具有100,000行的颜色表上进行了测试,查询时间不到1秒)。



相反,2000公式很长很复杂。我设法用SQL来实现它,但是它的性能不是很好:查询10,000行大约需要5秒,查询100,000行大约需要1分钟。



名为colorsearchtest 的示例应用(在Python / Flask / Postgres中),以试用我的实现(和我在Heroku上设置演示)。如果您试用此应用程序,则可以清楚地看到1976年和2000年Delta E查询之间的性能差异。



这是颜色表的架构(每个颜色,它将各自的RGB和Lab表示形式存储为三个数值):

 创建表颜色(
id整数NOT NULL,
rgb_r整数,
rgb_g整数,
rgb_b整数,
lab_l双精度,
lab_a双精度,
lab_b double precision
);

这是表中的一些数据(都是随机颜色,由我的应用程序中的脚本生成) :

 插入颜色(id,rgb_r,rgb_g,rgb_b,lab_l,lab_a,lab_b) 
值(902,164,214,189,81.6521019943304793,
-21.2561872439361323,7.08354581694699004);

插入颜色(id,rgb_r,rgb_g,rgb_b,lab_l,lab_a,lab_b)
值(903,113,229,64,81.7930860963098212,
-60.5865728472875205,66.4022741184551819 );

插入颜色(id,rgb_r,rgb_g,rgb_b,lab_l,lab_a,lab_b)
值(904,65,86,78,34.6593864327796624,
-9.95482220634028003,2.02661293272071719 );

...

这是Delta E CIE 2000 SQL函数我使用的是

 创建或替换功能
DELTA_E_CIE2000(双精度,双精度,
双精度,双精度,
双精度,双精度,
双精度,双精度,
双精度)
返回双精度
AS $$


c AS(选择
(CAST($ 1 AS VARCHAR)||','||
CAST($ 2 AS VARCHAR)|| ','||
CAST($ 3 AS VARCHAR)||','||
CAST($ 4 AS VARCHAR)||','||
CAST($ 5 AS VARCHAR)| |','||
CAST($ 6 AS VARCHAR))
AS lab_pair_str,
(($ 1 + $ 4)/
2.0)
AS avg_lp,
SQRT(
POW($ 2,2.0)+
POW($ 3,2.0))
AS c1,
SQRT(
POW(($ 5),2.0)+
POW(($ 6),2.0))
AS c2),
gs AS(选择
c.lab_pair_str,
(0.5 *
(1.0-SQRT(
POW(((c.c1 + c。 c2)/ 2.0),7.0)/(
POW((((c.c1 + c.c2)/ 2.0),7.0)+
POW(25.0,7.0)))))
AS g
从c
WHERE c.lab_pair_str =(
CAST($ 1 AS VARCHAR)|| ’,’||
CAST($ 2 AS VARCHAR)|| ’,’||
CAST($ 3 AS VARCHAR)|| ’,’||
CAST($ 4 AS VARCHAR)|| ’,’||
CAST($ 5 AS VARCHAR)|| ’,’||
CAST($ 6 AS VARCHAR))),
ap AS(选择
gs.lab_pair_str,
((1.0 + gs.g)* $ 2)
AS a1p ,
((1.0 + gs.g)* $ 5)
AS a2p
从gs
到gs.lab_pair_str =(
CAST($ 1 AS VARCHAR)|| ','||
CAST($ 2 AS VARCHAR)||','||
CAST($ 3 AS VARCHAR)||','||
CAST($ 4 AS VARCHAR)| |','||
CAST($ 5 AS VARCHAR)||','||
CAST($ 6 AS VARCHAR))),
cphp AS(SELECT
ap。 lab_pair_str,
SQRT(
POW(ap.a1p,2.0)+
POW($ 3,2.0))
AS c1p,
SQRT(
POW (ap.a2p,2.0)+
POW($ 6,2.0))
AS c2p,

DEGREES(ATAN2($ 3,ap.a1p))+(
情况
学位时(ATAN2($ 3,ap。 a1p))< 0.0
然后360.0
ELSE 0.0
END))
as h1p,

DEGREES(ATAN2($ 6,ap.a2p))+(
案例
学位时(ATAN2($ 6,ap.a2p))< 0.0
THEN 360.0
ELSE 0.0
END))
AS h2p
FROM ap
WHERE ap.lab_pair_str =(
CAST($ 1 AS VARCHAR)||','||
CAST($ 2 AS VARCHAR)||','||
CAST($ 3 AS VARCHAR)||','||
CAST($ 4 AS VARCHAR)||','||
CAST($ 5 AS VARCHAR)||','||
CAST($ 6 AS VARCHAR)),
av AS(SELECT
cphp.lab_pair_str,
((cphp.c1p + cphp.c2p)/
2.0)
AS avg_c1p_c2p,
((((案例
当(ABS(cphp.h1p- cphp.h2p)> 180.0)
然后360.0
ELSE 0.0
END)+
cphp.h1p +
cphp.h2p)/
2.0)
AS avg_hp
从cphp
在哪里cphp.lab_pair_str =(
CAST($ 1 AS VARCHAR)||','||
CAST($ 2 AS VARCHAR)||','||
CAST($ 3 AS VARCHAR)||','||
CAST($ 4 AS VARCHAR)||','||
CAST($ 5 AS VARCHAR)||','| |
CAST($ 6 AS VARCHAR))),
ts AS(SELECT
av.lab_pair_str,
(1.0-
0.17 * COS(RADIANS(av.avg_hp) -30.0))+
0.24 * COS(RADIANS(2.0 * av.avg_hp))+
0.32 * COS(RADIANS(3.0 * av.avg_hp + 6.0))-
0.2 * COS (RADIANS(4.0 * av.avg_hp-63.0)))
AS t,
((
(cphp.h2p-cphp.h1p)+
(CASE
WHEN(ABS(cphp.h2p-cphp.h1p)> 180.0)
然后360.0
ELSE 0.0
END))
-
(案例
当(cphp.h2p> cphp.h1p)
然后720.0
ELSE 0.0
END))
从delta_hlp
从av
内部连接cphp
在av.lab_pair_str = cphp.lab_pair_str
WHERE av.lab_pair_str =(
CAST($ 1 AS VARCHAR)||','||
CAST($ 2 AS VARCHAR)||','||
CAST($ 3 AS VARCHAR )||','||
CAST($ 4 AS VARCHAR)||','||
CAST($ 5 AS VARCHAR)||','||
CAST($ 6 AS VARCHAR))),
d AS(选择
ts.lab_pair_str,
($ 4-$ 1)
AS delta_lp,
(cphp.c2p-cphp.c1p)
AS delta_cp,
(2.0 *(
SQRT(cphp.c 2p * cphp.c1p)*
SIN(RADIANS(ts.delta_hlp)/ 2.0)))
AS delta_hp,
(1.0 +(
(0.015 * POW(c。 avg_lp-50.0,2.0))/
SQRT(20.0 + POW(c.avg_lp-50.0,2.0))))
AS s_l,
(1.0 + 0.045 * av.avg_c1p_c2p)
AS s_c,
(1.0 + 0.015 * av.avg_c1p_c2p * ts.t)
AS s_h,
(30.0 * EXP(-(POW((((av.avg_hp-275.0 )/ 25.0),2.0))))
AS delta_ro,
SQRT(
(POW(av.avg_c1p_c2p,7.0))/
(POW(av.avg_c1p_c2p,7.0 )+ POW(25.0,7.0)))
AS r_c
从ts
内部联接cphp
在ts.lab_pair_str = cphp.lab_pair_str
内部联接c
ts.lab_pair_str = c.lab_pair_str
内部联接av
ts.lab_pair_str = av.lab_pair_str
ts.lab_pair_str =(
CAST($ 1 AS VARCHAR) || ’,’||
CAST($ 2 AS VARCHAR)|| ’,’||
CAST($ 3 AS VARCHAR)|| ’,’||
CAST($ 4 AS VARCHAR)|| ’,’||
CAST($ 5 AS VARCHAR)|| ’,’||
CAST($ 6 AS VARCHAR))),
r AS(选择
d.lab_pair_str,
(-2.0 * d.r_c * SIN(2.0 * RADIANS(d.delta_ro)) ))
AS r_t
从d
WHERE d.lab_pair_str =(
CAST($ 1 AS VARCHAR)||','||
CAST($ 2 AS VARCHAR )||','||
CAST($ 3 AS VARCHAR)||','||
CAST($ 4 AS VARCHAR)||','||
CAST($ 5 AS VARCHAR)||','||
CAST($ 6 AS VARCHAR)))
SELECT
SQRT(
POW(d.delta_lp /(d.s_l * $ 7), 2.0)+
POW(d.delta_cp /(d.s_c * $ 8),2.0)+
POW(d.delta_hp /(d.s_h * $ 9),2.0)+
r .r_t *
(d.delta_cp /(d.s_c * $ 8))*
(d.delta_hp /(d.s_h * $ 9)))
AS delta_e_cie2000
从r
内部联接d
在r.lab_pair_str = d.lab_pair_str
哪里r.lab_pair_str =(
CAST($ 1 AS VARCHAR)|| ’,’||
CAST($ 2 AS VARCHAR)|| ’,’||
CAST($ 3 AS VARCHAR)|| ’,’||
CAST($ 4 AS VARCHAR)|| ’,’||
CAST($ 5 AS VARCHAR)|| ’,’||
CAST($ 6 AS VARCHAR))

$$

语言SQL
不可变
在输入为空时返回NULL;

(我最初使用嵌套的子查询编写了此函数,深度约为10层,但是后来我重新编写了该函数改为使用 WITH 语句,即Postgres CTE。新版本更具可读性,并且性能类似于旧版本。您可以看到代码中的两个版本。)



定义函数后,在如下查询中使用它:

  SELECT c .rgb_r,
c.rgb_g,
c.rgb_b,
DELTA_E_CIE2000(73.9206633504,-50.2996953437,
23.8259166281,
c.lab_l,c.lab_a,c。 lab_b,
1.0、1.0、1.0)
as de2000
从颜色c
订购de2000
限度100;

所以,我的问题是:有什么方法可以改善 DELTA_E_CIE2000 函数,使其可实时用于非平凡的数据集?或者,考虑到公式的复杂性,它是否会达到最快的速度?



根据我在演示应用程序中所做的测试,他说,对于在网站上进行简单的相似颜色搜索的用例,1976年和2000年函数在结果准确性方面的差异实际上可以忽略不计。也就是说,我已经有信心,对于我的需求,1976年的公式足够好。但是,2000函数的确返回更好的结果(很大程度上取决于输入颜色在Lab空间中的位置),实际上,我只是好奇它是否可以进一步加速。

解决方案

两件事:1)您没有完全使用数据库,并且2)您的问题是自定义PostgreSQL扩展的一个很好的例子。这就是为什么。



您仅使用数据库作为存储,将颜色存储为浮点数。在当前配置中,无论查询类型如何,数据库始终将必须检查所有值(进行顺序扫描)。这意味着需要大量的IO和大量的计算(对于很少的返回匹配)。您正在尝试找到最接近的N种颜色,因此有几种方法可以避免对所有数据进行计算。



简单改进



最简单的方法是将计算限制在较小的数据子集中。如果组件之间的差异更大,则可以假定差异会更大。如果您发现组件之间存在安全差异,而结果总是不合适,则可以使用带btree索引的WHERE范围完全排除这些颜色。但是,由于L * a * b颜色空间的性质,这可能会使您的结果恶化。



首先创建索引:

 创建索引color_lab_l_btree使用btree(lab_l)打开颜色; 
CREATE INDEX color_lab_a_btree使用btree(lab_a)打开颜色;
创建索引color_lab_b_btree使用btree(lab_b)打开颜色;

然后,我对您的查询进行了修改,使其包含WHERE子句以仅过滤颜色,其中任何组件均不同最多20个。



更新:再看一遍,添加20的限制很可能会使结果恶化,因为我发现至少

  SELECT 
c.rgb_r,c.rgb_g,c .rgb_b,
DELTA_E_CIE2000(
25.805780252087963,53.33446637366859,-45.03961353720049,
c.lab_l,c.lab_a,c.lab_b,
1.0,1.0,1.0)AS de2000
从颜色c

c.lab_l之间25.805780252087963-20和25.805780252087963 + 20
和c.lab_a之间53.33446637366859-20和53.33446637366859 + 20
和c.lab_b之间-45.03961353720049-20和-45.03961353720049 + 20
ORDER BY de2000;

我用脚本填充了100000种随机颜色并进行了测试:



无索引的时间:44006851 ms



有索引和范围查询的时间:1292932 ms



您也可以将此WHERE子句添加到 delta_e_cie1976_query ,在我的随机数据上,查询时间也从〜110 ms减少到〜22 ms。 / p>

BTW:根据经验,我得到了20:我尝试了10,但是只有380条记录,这似乎有点低,并且由于上限为100,可能会排除一些更好的选择。如果有20个,则完整的记录为2900行,可以肯定的是,最接近的匹配将在那里。我没有详细研究DELTA_E_CIE2000或L * a * b *颜色空间,因此阈值可能需要沿着不同的分量进行调整才能真正实现,但是原则是排除不感兴趣的数据。



用C重写Delta E CIE 2000



您已经说过,Delta E CIE 2000很复杂,非常不适合实现在SQL中。目前在我的笔记本电脑上,每次通话大约需要0.4毫秒。用C实现它可以大大加快此过程。 PostgreSQL将SQL函数的默认成本分配为100,将C函数的缺省成本分配为1。我猜这是基于实际经验。



更新:由于这也划伤了我的一角,因此我重新实现了C语言中colormath模块中的Delta E函数作为PostgreSQL扩展,可在 PGXN 。有了这个,当我查询带有100k条记录的表中的所有记录时,CIE2000的速度提高了约150倍。



有了这个C函数,我得到的查询时间为147 ms和160 ms(100k种颜色)。有了额外的WHERE,查询时间大约是20毫秒,这对我来说似乎可以接受。



最好的,但是高级的解决方案



但是,由于您的问题是在3维空间中进行N个最近邻居搜索,因此您可以使用PostgreSQL 从9.1版开始



为此,您需要输入L将* a * b *组件插入多维数据集。此扩展程序尚不支持距离运算符(它位于),但是即使可以,它也不支持Delta E距离,您需要将其重新实现为C扩展名。



这意味着实现GiST索引运算符类( btree_gist PostgreSQL扩展可以支持根据Delta E距离进行索引。好部分是您可以对不同版本的Delta E使用不同的运算符,例如。对于Delta E CIE 2000,<-> ,对于Delta E CIE 1976,&#; 即使使用Delta E CIE,也可以 真的非常快 2000。



最后,这可能取决于您(业务)的要求和约束条件。


I have a database table where each row is a color. My goal: given an input color, calculate its distance to each color in the DB table, and sort the results by that distance. Or, stated as a user story: when I choose a color, I want to see a list of the colors that are most similar to the one that I picked, with the closest matches at the top of the list.

I understand that, in order to do this, the various Delta E (CIE Lab) formulae are the best choice. I wasn't able to find any native SQL implementations of the formulae, so I wrote my own SQL versions of Delta E CIE 1976 and Delta E CIE 2000. I verified the accuracy of my SQL versions of the formulae, against the results generated by the python-colormath implementations.

The 1976 formula is easy to write, in SQL or in any other language, because it's a simple Euclidean distance calculation. It performs nice and fast for me, on datasets of any size (tested it on a color table with 100,000 rows, and the query takes less than 1 second).

The 2000 formula, in contrast, is very long and complex. I managed to implement it in SQL, but its performance is not great: about 5 seconds to query 10,000 rows, and about 1 minute to query 100,000 rows.

I wrote an example app called colorsearchtest (in Python / Flask / Postgres), to play around with my implementations (and I set up a demo on Heroku). If you try out this app, you can clearly see the performance difference between the 1976 and the 2000 Delta E queries.

This is the schema for the color table (for each color, it stores the respective RGB and Lab representations, as three numeric values):

CREATE TABLE color (
    id integer NOT NULL,
    rgb_r integer,
    rgb_g integer,
    rgb_b integer,
    lab_l double precision,
    lab_a double precision,
    lab_b double precision
);

This is some data in the table (all just random colors, generated by a script in my app):

INSERT INTO color (id, rgb_r, rgb_g, rgb_b, lab_l, lab_a, lab_b)
VALUES (902, 164, 214, 189, 81.6521019943304793,
        -21.2561872439361323, 7.08354581694699004);

INSERT INTO color (id, rgb_r, rgb_g, rgb_b, lab_l, lab_a, lab_b)
VALUES (903, 113, 229, 64, 81.7930860963098212,
        -60.5865728472875205, 66.4022741184551819);

INSERT INTO color (id, rgb_r, rgb_g, rgb_b, lab_l, lab_a, lab_b)
VALUES (904, 65, 86, 78, 34.6593864327796624,
        -9.95482220634028003, 2.02661293272071719);

...

And this is the Delta E CIE 2000 SQL function that I'm using:

CREATE OR REPLACE FUNCTION
DELTA_E_CIE2000(double precision, double precision,
                double precision, double precision,
                double precision, double precision,
                double precision, double precision,
                double precision)
RETURNS double precision
AS $$

WITH
    c AS (SELECT
            (CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))
        AS lab_pair_str,
            (($1 + $4) /
                2.0)
        AS avg_lp,
            SQRT(
                POW($2, 2.0) +
                POW($3, 2.0))
        AS c1,
            SQRT(
                POW(($5), 2.0) +
                POW(($6), 2.0))
        AS c2),
    gs AS (SELECT
            c.lab_pair_str,
            (0.5 *
                (1.0 - SQRT(
                    POW(((c.c1 + c.c2) / 2.0), 7.0) / (
                        POW(((c.c1 + c.c2) / 2.0), 7.0) +
                        POW(25.0, 7.0)))))
        AS g
        FROM c
        WHERE c.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    ap AS (SELECT
            gs.lab_pair_str,
            ((1.0 + gs.g) * $2)
        AS a1p,
            ((1.0 + gs.g) * $5)
        AS a2p
        FROM gs
        WHERE gs.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    cphp AS (SELECT
            ap.lab_pair_str,
            SQRT(
                POW(ap.a1p, 2.0) +
                POW($3, 2.0))
        AS c1p,
            SQRT(
                POW(ap.a2p, 2.0) +
                POW($6, 2.0))
        AS c2p,
            (
                DEGREES(ATAN2($3, ap.a1p)) + (
                    CASE
                        WHEN DEGREES(ATAN2($3, ap.a1p)) < 0.0
                        THEN 360.0
                        ELSE 0.0
                        END))
        AS h1p,
            (
                DEGREES(ATAN2($6, ap.a2p)) + (
                    CASE
                        WHEN DEGREES(ATAN2($6, ap.a2p)) < 0.0
                        THEN 360.0
                        ELSE 0.0
                        END))
        AS h2p
        FROM ap
        WHERE ap.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    av AS (SELECT
            cphp.lab_pair_str,
            ((cphp.c1p + cphp.c2p) /
                2.0)
        AS avg_c1p_c2p,
            (((CASE
                WHEN (ABS(cphp.h1p - cphp.h2p) > 180.0)
                THEN 360.0
                ELSE 0.0
                END) +
              cphp.h1p +
              cphp.h2p) /
                2.0)
        AS avg_hp
        FROM cphp
        WHERE cphp.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    ts AS (SELECT
            av.lab_pair_str,
            (1.0 -
                0.17 * COS(RADIANS(av.avg_hp - 30.0)) +
                0.24 * COS(RADIANS(2.0 * av.avg_hp)) +
                0.32 * COS(RADIANS(3.0 * av.avg_hp + 6.0)) -
                0.2 * COS(RADIANS(4.0 * av.avg_hp - 63.0)))
        AS t,
            ((
                    (cphp.h2p - cphp.h1p) +
                    (CASE
                        WHEN (ABS(cphp.h2p - cphp.h1p) > 180.0)
                        THEN 360.0
                        ELSE 0.0
                        END))
                -
                (CASE
                    WHEN (cphp.h2p > cphp.h1p)
                    THEN 720.0
                    ELSE 0.0
                    END))
        AS delta_hlp
        FROM av
        INNER JOIN cphp
        ON av.lab_pair_str = cphp.lab_pair_str
        WHERE av.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    d AS (SELECT
            ts.lab_pair_str,
            ($4 - $1)
        AS delta_lp,
            (cphp.c2p - cphp.c1p)
        AS delta_cp,
            (2.0 * (
                SQRT(cphp.c2p * cphp.c1p) *
                SIN(RADIANS(ts.delta_hlp) / 2.0)))
        AS delta_hp,
            (1.0 + (
                (0.015 * POW(c.avg_lp - 50.0, 2.0)) /
                SQRT(20.0 + POW(c.avg_lp - 50.0, 2.0))))
        AS s_l,
            (1.0 + 0.045 * av.avg_c1p_c2p)
        AS s_c,
            (1.0 + 0.015 * av.avg_c1p_c2p * ts.t)
        AS s_h,
            (30.0 * EXP(-(POW(((av.avg_hp - 275.0) / 25.0), 2.0))))
        AS delta_ro,
            SQRT(
                (POW(av.avg_c1p_c2p, 7.0)) /
                (POW(av.avg_c1p_c2p, 7.0) + POW(25.0, 7.0)))
        AS r_c
        FROM ts
        INNER JOIN cphp
        ON ts.lab_pair_str = cphp.lab_pair_str
        INNER JOIN c
        ON ts.lab_pair_str = c.lab_pair_str
        INNER JOIN av
        ON ts.lab_pair_str = av.lab_pair_str
        WHERE ts.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR))),
    r AS (SELECT
            d.lab_pair_str,
            (-2.0 * d.r_c * SIN(2.0 * RADIANS(d.delta_ro)))
        AS r_t
        FROM d
        WHERE d.lab_pair_str = (
            CAST($1 AS VARCHAR) || ',' ||
            CAST($2 AS VARCHAR) || ',' ||
            CAST($3 AS VARCHAR) || ',' ||
            CAST($4 AS VARCHAR) || ',' ||
            CAST($5 AS VARCHAR) || ',' ||
            CAST($6 AS VARCHAR)))
SELECT
        SQRT(
            POW(d.delta_lp / (d.s_l * $7), 2.0) +
            POW(d.delta_cp / (d.s_c * $8), 2.0) +
            POW(d.delta_hp / (d.s_h * $9), 2.0) +
            r.r_t *
            (d.delta_cp / (d.s_c * $8)) *
            (d.delta_hp / (d.s_h * $9)))
    AS delta_e_cie2000
FROM          r
INNER JOIN    d
ON            r.lab_pair_str = d.lab_pair_str
WHERE         r.lab_pair_str = (
          CAST($1 AS VARCHAR) || ',' ||
          CAST($2 AS VARCHAR) || ',' ||
          CAST($3 AS VARCHAR) || ',' ||
          CAST($4 AS VARCHAR) || ',' ||
          CAST($5 AS VARCHAR) || ',' ||
          CAST($6 AS VARCHAR))

$$

LANGUAGE SQL
IMMUTABLE
RETURNS NULL ON NULL INPUT;

(I originally wrote this function using nested subqueries about 10 levels deep, but I then re-wrote it to instead use WITH statements, i.e. Postgres CTEs. The new version is much more readable, and performance is similar to the old version. You can see both versions in the code.)

After defining the function, I use it in a query like this:

SELECT        c.rgb_r,
              c.rgb_g,
              c.rgb_b,
        DELTA_E_CIE2000(73.9206633504, -50.2996953437,
                        23.8259166281,
                        c.lab_l, c.lab_a, c.lab_b,
                        1.0, 1.0, 1.0)
    AS de2000
FROM          color c
ORDER BY      de2000
LIMIT         100;

So, my question: is there any way that I could improve the performance of the DELTA_E_CIE2000 function, to make it usable in real-time for non-trivial data sets? Or, considering the complexity of the formula, is that as fast as it's going to get?

From the testing that I've done in my demo app, I'd say that for the use case of a simple "similar colors" search on a web site, the difference in results accuracy between the 1976 and 2000 functions is actually negligible. That is, I'm already confident that for my needs, the 1976 formula is "good enough". However, the 2000 function does return slightly better results (depending very much on where the input color lies in the Lab space), and really, I'm just curious as to whether it could be sped up further.

解决方案

Two things: 1) you are not using the database to its full extent and 2) your problem is a great example for a custom PostgreSQL extension. Here's why.

You are only using database as storage, storing colors as floats. In your current configuration, regardless of the type of query, the database will always have to check all values (make a sequential scan). This means a lot of IO and a lot of calculation for few returned matches. You are trying to find the nearest N colors, so there are a few possibilities on how to avoid performing calculations on all data.

Simple improvement

Simplest is to limit your calculations to a smaller subset of data. You can assume the difference will be greater if the components differ more. If you can find a safe difference between the components, where the results are always inappropriate, you can exclude those colors altogether using ranged WHERE with btree indexes. However, due to nature of L*a*b colorspace, this will likely worsen your results.

First create the indexes:

CREATE INDEX color_lab_l_btree ON color USING btree (lab_l);
CREATE INDEX color_lab_a_btree ON color USING btree (lab_a);
CREATE INDEX color_lab_b_btree ON color USING btree (lab_b);

Then I adapted your query to include a WHERE clause to filter only colors, where any of the components differs for at most 20.

Update: After another look, adding a limit of 20 will very likely worsen the results, since I found at least one point in space, for which this holds true.:

SELECT 
    c.rgb_r, c.rgb_g, c.rgb_b,
    DELTA_E_CIE2000(
        25.805780252087963, 53.33446637366859, -45.03961353720049, 
        c.lab_l, c.lab_a, c.lab_b,
        1.0, 1.0, 1.0) AS de2000
FROM color c 
WHERE 
    c.lab_l BETWEEN 25.805780252087963 - 20 AND 25.805780252087963 + 20 
    AND c.lab_a BETWEEN 53.33446637366859 - 20 AND 53.33446637366859 + 20 
    AND c.lab_b BETWEEN -45.03961353720049 - 20 AND -45.03961353720049 + 20 
ORDER BY de2000 ;

I filled the table with 100000 of random colors with your script and tested:

Time without indexes: 44006,851 ms

Time with indexes and range query: 1293,092 ms

You can add this WHERE clause to delta_e_cie1976_query too, on my random data it drops the query time from ~110 ms to ~22 ms.

BTW: I got the number 20 empirically: I tried with 10, but got only 380 records, which seems a little low and might exclude some better options since the limit is 100. With 20 the full set was 2900 rows and one can be fairly sure that the closest matches will be there. I didn't study the DELTA_E_CIE2000 or L*a*b* color space in detail so the threshold may need adjustment along different components for that to actually be true, but the principle of excluding non-interesting data holds.

Rewrite Delta E CIE 2000 in C

As you've already said, Delta E CIE 2000 is complex and fairly unsuitable for implementing in SQL. It currently uses about 0.4 ms per call on my laptop. Implementing it in C should considerably speed this up. PostgreSQL assigns default cost to SQL functions as 100 and C functions as 1. I'm guessing this is based on real experience.

Update: Since this also scratches one of my itches, I reimplemented the Delta E functions from colormath module in C as a PostgreSQL extension, available on PGXN. With this I can see a speedup of about 150x for CIE2000 when querying all the records from the table with 100k records.

With this C function, I get query times between 147 ms and 160 ms for 100k colors. With extra WHERE, query time is about 20 ms, which seems quite acceptable for me.

Best, but advanced solution

However, since your problem is N nearest neighbor search in 3-dimensional space, you could use K-Nearest-Neighbor Indexing which is in PostgreSQL since version 9.1.

For that to work, you'd put L*a*b* components into a cube. This extension does not yet support distance operator (it's in the works), but even if it would, it would not support Delta E distances and you would need to reimplement it as a C extension.

This means implementing GiST index operator class (btree_gist PostgreSQL extension in contrib does this) to support indexing according to Delta E distances. The good part is you could then use different operators for different versions of Delta E, eg. <-> for Delta E CIE 2000 and <#> for Delta E CIE 1976 and queries would be really really fast for small LIMIT even with Delta E CIE 2000.

In the end it may depend on what your (business) requirements and constraints are.

这篇关于SQL中Delta E(CIE Lab)计算和排序的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆