SQL:查找行之间的最长公共字符串 [英] SQL: Find longest common string between rows

查看:175
本文介绍了SQL:查找行之间的最长公共字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张桌子T1:

    Col
   -------
  1 THE APPLE
 THE APPLE
 THE APPLE 123
 THE APPLE 12/16
 BEST THE APPLE

我想要T2:

 Result
--------
 THE APPLE

我正在使用Redshift,寻找在SQL中进行模糊字符串匹配的方法.列的最大长度为100个字符.我绝不必比较多于25行.

I am using Redshift, Looking for some way to do a fuzzystring match in SQL. Longest length possible for column is 100 characters. At no point will I have to compare more than 25 rows.

推荐答案

此问题需要相当程度的复杂性才能解决,并且其运行时间将随着字符串长度和记录数量的增加而急剧增加.但是,鉴于表T1很小,您可以仅使用下面的PL/pgSQL函数进行管理.

This question requires a fair degree of complication to solve and it's running time will drastically increase with increasing string lengths and number of records. However, given that your table T1 is rather small, you might just manage with the below PL/pgSQL function.

  1. 找到T1(col)中的最短值.这是所有记录中最长的匹配项.这是候选字符串.
  2. 查看候选对象是否存在于T1的所有其他行中.如果是这样,请将当前的候选者返回结果集.
  3. 将候选项移动到最短值的一个位置,然后返回到步骤2,直到候选项到达最短字符串的末尾.
  4. 如果找到匹配的候选者,则从函数中返回.否则,将候选者缩短1,然后从最短字符串的开头重新开始,然后移至步骤2.如果无法从最短字符串中提取出更多候选者,则返回NULL.
  1. Find the shortest value in T1(col). This is the longest possible match among all records. This is the candidate string.
  2. See if the candidate is present in all other rows of T1. If so, return the present candidate to the result set.
  3. Move the candidate over one position in the shortest value and back to step 2 until the candidate reaches the end of the shortest string.
  4. If matching candidates have been found, return from the function. Otherwise, shorten the candidate by 1 and start over from the beginning of the shortest string and move to step 2. If no more candidates can be extracted from the shortest string, then return NULL.

代码

下面的代码中重要的是检查匹配是否短路:只要单个记录与候选字符串不匹配col,就无需进一步检查.因此,对于长字符串,实际上是将最短的字符串与另一个字符串进行比较,仅当候选字符串太短以至于它们确实更普遍时,才增加所检查的行.

Code

The important thing in the code below is the short-circuit in the check for a match: as soon as a single record does not match col to the candidate string there is no need to check further. So for long strings, the comparison is really from the shortest string with one other string, increasing the rows examined only when candidate strings get so short that they are indeed more ubiquitous.

字符串比较区分大小写;如果要使其不区分大小写,请将LIKE更改为ILIKE.作为一项奖励功能,您将获得所有行中都存在的多个匹配字符串(显然都具有相同的长度).不利的一面是,一旦下降到单个字符匹配,它将报告多个相同的字符串(可能还有一些2字符及更长的字符串).您可以使用SELECT DISTINCT *删除那些重复项.

The string comparison is case-sensitive; If you want to make it case-insensitive, change LIKE to ILIKE. As a bonus feature, you will get multiple matching strings (obviously all of the same length) that are all present in all rows. On the down side, it will report multiple identical strings once it gets down to single char matches (and possible some 2-char and longer strings). You can use a SELECT DISTINCT * to get rid of those duplicates.

CREATE FUNCTION find_longest_string_in_T1() RETURNS SETOF text AS $$
DECLARE
  shortest  varchar;       -- The shortest string in T1(col) so the longest possible match
  candidate varchar;       -- Candidate string to test
  sz_sh     integer;       -- Length of "shortest"
  l         integer := 1;  -- Starting position of "candidate" in "shortest"
  sz        integer;       -- Length of "candidate"
  fail      boolean;       -- Has "candidate" been found in T1(col)?
  found_one boolean := false; -- Flag if we found at least one match
BEGIN
  -- Find the shortest string and its size, don't worry about multiples, need just 1
  SELECT col, char_length(col) INTO shortest, sz_sh
  FROM T1
  ORDER BY char_length(col) ASC NULLS LAST
  LIMIT 1;

  -- Get all the candidates from the shortest string and test them from longest to single char
  candidate := shortest;
  sz := sz_sh;
  LOOP
    -- Check rows in T1 if they contain the candidate string.
    -- Short-circuit as soon as a record does not match the candidate
    <<check_T1>>
    BEGIN
      FOR fail IN SELECT col NOT LIKE '%' || candidate || '%' FROM T1 LOOP
        EXIT check_T1 WHEN fail;
      END LOOP;
      -- Block was not exited, so the candidate is present in all rows: we have a match
      RETURN NEXT candidate;
      found_one := true;
    END;

    -- Produce the next candidate
    IF l+sz > sz_sh THEN -- "candidate" reaches to the end of "shortest"
      -- Exit if we already have at least one matching candidate
      EXIT WHEN found_one;
      -- .. otherwise shorthen the candidate
      sz := sz - 1;
      l := 1;
    ELSE
      -- Exit with empty result when all candidates have been examined
      EXIT WHEN l = sz_sh;
      -- .. otherwise move one position over to get the next candidate
      l := l + 1;
    END IF;
    candidate := substring(shortest from l for sz);
  END LOOP;

  RETURN;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

致电SELECT * FROM find_longest_string_in_T1()应该可以解决问题.

Calling SELECT * FROM find_longest_string_in_T1() should do the trick.

创建一些测试数据:

INSERT INTO T1 
  SELECT 'hello' || md5(random()::text) || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT md5(random()::text) || 'match' || 'hello' || md5(random()::text)  || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT 'match' || md5(random()::text) || 'hello' || md5(random()::text)  || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT md5(random()::text) || 'hello' || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);

这将生成100行,每行106个字符,并产生匹配项"hello"和"match"(以及其他匹配项的可能性很小).这样可以在不到半秒的时间内生成正确的两个字符串(没有多余的装饰,如Ubuntu服务器,PG 9.3,CPU i5、4GB RAM).

This generates 100 rows of 106 characters long and yields the matches "hello" and "match" (and very unlikely any other matches). This produces the correct two strings in less than half a second (no frills Ubuntu server, PG 9.3, CPU i5, 4GB of RAM).

这篇关于SQL:查找行之间的最长公共字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆