SQL:查找行之间的最长公共字符串 [英] SQL: Find longest common string between rows

查看：175 发布时间：2020/8/23 3:58:29 sql string postgresql amazon-redshift

本文介绍了SQL:查找行之间的最长公共字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一张桌子T1:

    Col
   -------
  1 THE APPLE
 THE APPLE
 THE APPLE 123
 THE APPLE 12/16
 BEST THE APPLE

我想要T2:

 Result
--------
 THE APPLE

我正在使用Redshift，寻找在SQL中进行模糊字符串匹配的方法.列的最大长度为100个字符.我绝不必比较多于25行.

I am using Redshift, Looking for some way to do a fuzzystring match in SQL. Longest length possible for column is 100 characters. At no point will I have to compare more than 25 rows.

代码

下面的代码中重要的是检查匹配是否短路:只要单个记录与候选字符串不匹配col，就无需进一步检查.因此，对于长字符串，实际上是将最短的字符串与另一个字符串进行比较，仅当候选字符串太短以至于它们确实更普遍时，才增加所检查的行.

Code

The important thing in the code below is the short-circuit in the check for a match: as soon as a single record does not match col to the candidate string there is no need to check further. So for long strings, the comparison is really from the shortest string with one other string, increasing the rows examined only when candidate strings get so short that they are indeed more ubiquitous.

字符串比较区分大小写；如果要使其不区分大小写，请将LIKE更改为ILIKE.作为一项奖励功能，您将获得所有行中都存在的多个匹配字符串(显然都具有相同的长度).不利的一面是，一旦下降到单个字符匹配，它将报告多个相同的字符串(可能还有一些2字符及更长的字符串).您可以使用SELECT DISTINCT *删除那些重复项.

The string comparison is case-sensitive; If you want to make it case-insensitive, change LIKE to ILIKE. As a bonus feature, you will get multiple matching strings (obviously all of the same length) that are all present in all rows. On the down side, it will report multiple identical strings once it gets down to single char matches (and possible some 2-char and longer strings). You can use a SELECT DISTINCT * to get rid of those duplicates.

CREATE FUNCTION find_longest_string_in_T1() RETURNS SETOF text AS $$
DECLARE
  shortest  varchar;       -- The shortest string in T1(col) so the longest possible match
  candidate varchar;       -- Candidate string to test
  sz_sh     integer;       -- Length of "shortest"
  l         integer := 1;  -- Starting position of "candidate" in "shortest"
  sz        integer;       -- Length of "candidate"
  fail      boolean;       -- Has "candidate" been found in T1(col)?
  found_one boolean := false; -- Flag if we found at least one match
BEGIN
  -- Find the shortest string and its size, don't worry about multiples, need just 1
  SELECT col, char_length(col) INTO shortest, sz_sh
  FROM T1
  ORDER BY char_length(col) ASC NULLS LAST
  LIMIT 1;

  -- Get all the candidates from the shortest string and test them from longest to single char
  candidate := shortest;
  sz := sz_sh;
  LOOP
    -- Check rows in T1 if they contain the candidate string.
    -- Short-circuit as soon as a record does not match the candidate
    <<check_T1>>
    BEGIN
      FOR fail IN SELECT col NOT LIKE '%' || candidate || '%' FROM T1 LOOP
        EXIT check_T1 WHEN fail;
      END LOOP;
      -- Block was not exited, so the candidate is present in all rows: we have a match
      RETURN NEXT candidate;
      found_one := true;
    END;

    -- Produce the next candidate
    IF l+sz > sz_sh THEN -- "candidate" reaches to the end of "shortest"
      -- Exit if we already have at least one matching candidate
      EXIT WHEN found_one;
      -- .. otherwise shorthen the candidate
      sz := sz - 1;
      l := 1;
    ELSE
      -- Exit with empty result when all candidates have been examined
      EXIT WHEN l = sz_sh;
      -- .. otherwise move one position over to get the next candidate
      l := l + 1;
    END IF;
    candidate := substring(shortest from l for sz);
  END LOOP;

  RETURN;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

致电SELECT * FROM find_longest_string_in_T1()应该可以解决问题.

Calling SELECT * FROM find_longest_string_in_T1() should do the trick.

创建一些测试数据:

INSERT INTO T1 
  SELECT 'hello' || md5(random()::text) || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT md5(random()::text) || 'match' || 'hello' || md5(random()::text)  || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT 'match' || md5(random()::text) || 'hello' || md5(random()::text)  || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT md5(random()::text) || 'hello' || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);

这将生成100行，每行106个字符，并产生匹配项"hello"和"match"(以及其他匹配项的可能性很小).这样可以在不到半秒的时间内生成正确的两个字符串(没有多余的装饰，如Ubuntu服务器，PG 9.3，CPU i5、4GB RAM).

This generates 100 rows of 106 characters long and yields the matches "hello" and "match" (and very unlikely any other matches). This produces the correct two strings in less than half a second (no frills Ubuntu server, PG 9.3, CPU i5, 4GB of RAM).

这篇关于SQL:查找行之间的最长公共字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SQL:查找行之间的最长公共字符串 [英] SQL: Find longest common string between rows

问题描述

推荐答案

代码

Code

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

SQL:查找行之间的最长公共字符串 [英] SQL: Find longest common string between rows

问题描述

推荐答案

代码

Code

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭