SQL字符串操作,查找所有排列 [英] SQL String manipulation, find all permutations

查看:69
本文介绍了SQL字符串操作,查找所有排列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我在列中有一组字符串,格式为'/A/B/C/D/E'

So I have a set of strings in a column in this format '/A/B/C/D/E'

A,B,C,D,E代表不同长度的字符串.我如何从上面的字符串变成一组始终保持降序,按时间顺序排列的字符串.我的意思是,A只能跟在B后面,B只能跟在C后面,并且必须以A开头.结果如下:

A, B, C, D, E represent strings of varying lengths. How would I go from the string above into a set of strings that always maintains descending, chronological order. What I mean is A can only be followed by B, B can only be followed by C and must be preceded by A. The result would be the following:

结果:

'/A'

'/A/B'

'/A/B/C'

'/A/B/C/D'

'/A/B/C/D/E'

推荐答案

我刚想到几天前我写了一个函数,它非常适合这种类型的事情.该函数在下面,请注意我的代码注释以获取有关该函数的更多详细信息.

It just occurred to me that I wrote a function a few days ago that is perfect for this type of thing. The function is below, note my code comments for more details about the function.

解决方案

select itemNumber = tokenlen/2, leftToken
from dbo.edgeNgrams8k('/A/B/C/D/E')
where tokenlen % 2 = 0;

结果

itemNumber           leftToken
-------------------- -----------
1                    /A
2                    /A/B
3                    /A/B/C
4                    /A/B/C/D
5                    /A/B/C/D/E

功能

if object_id('dbo.edgeNgrams8k', 'IF') is not null drop function dbo.edgeNgrams8k;
go
create function dbo.edgeNgrams8k(@string varchar(8000))
/*****************************************************************************************
Purpose
  edgeNgrams8k is an inline table valued function (itvf) that accepts a varchar(8000) 
  input string (@string) and returns a series of character-level left and right edge 
  n-grams. An edge n-gram (referred to herin as an "edge-gram" for brevity) is a type of 
  n-gram (see https://en.wikipedia.org/wiki/N-gram). Instead of a contiguous series of 
  n-sized tokens (n-grams), however, an edge n-gram is a series of tokens that that begin 
  with the input string's first character then increases by one character, the next in the
  string, unitl the token is as long as the input string. 

  Left edge-grams start at the beginning of the string and grow from left-to-right. Right
  edge-grams begin at the end of the string and grow from right-to-left. Note this query
  and the result-set:

  select * from dbo.edgeNgrams8k('ABC');

  tokenlen   leftToken    rightTokenIndex  righttoken
  ---------- ------------ ---------------- ----------
  1          A            3                C
  2          AB           2                BC
  3          ABC          1                ABC

Developer Notes:
 1. For more about N-Grams in SQL Server see: http://www.sqlservercentral.com/articles/Tally+Table/142316/
    For more about Edge N-Grams see the documentation by Elastic here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html 

 2. dbo.edgeNgrams8k is deterministic. For more about determinism see: https://docs.microsoft.com/en-us/sql/relational-databases/user-defined-functions/deterministic-and-nondeterministic-functions

 3. If you need to sort this data without getting a sort in your execution plan you can 
    sort by tokenLen for ascending order, or by rightTokenIndex for descending order.

------------------------------------------------------------------------------------------
Usage Examples:
  I need to turn /A/B/C/D/E into:
  /A
  /A/B
  .....
  /A/B/C/D/E

  select leftToken 
    from dbo.edgeNgrams8k('/A/B/C/D/E')
  where tokenLen % 2 = 0

------------------------------------------------------------------------------------------
History:
 20171125 - Initial Development - Developed by Alan Burstein  
*****************************************************************************************/
returns table with schemabinding as return
with iTally(n) as 
(
  select top (len(@string)) row_number() over (order by (select $))
  from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a(x), -- 10^1 = 10
       (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b(x), -- 10^2 = 100
       (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) c(x), -- 10^3 = 1000
       (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d(x)  -- 10^4 = 10000
)
select top (convert(bigint, len(@string), 0))
  tokenlen        = n,
  leftToken       = substring(@string,1,n),
  rightTokenIndex = len(@string)+1-n,
  righttoken      = substring(@string,len(@string)+1-n, n)
from itally;
go

更新-性能测试

为了强调我的观点,为什么要使用基于集合的方法,我准备了100K Row测试.

To emphasize my point about why set based is the way to go, I prepared a 100K Row test.

首先是递归CTE解决方案,然后是我的 first 解决方案,作为内联表值函数.您需要 INLINE 表值函数,因为它们可以从并行处理中受益,稍后我将向您展示.功能:

First the recursive CTE solution, then my first solution in as inline table valued functions. You want INLINE table valued functions as they can benefit from parallel processing as I'll show you in a moment. The functions:

-- Gordon's logic as an inline table valued function
create function dbo.rCTE_GL (@string varchar(8000))
returns table as return
with x as (select @string as col),
     cte as (
      select col
      from x
      union all
      select left(col, len(col) - charindex('/', reverse(col))) as col
      from cte
      where col like '/%/%'
     )
select *
from cte;
GO

-- My logic as a table valued function
create function dbo.tally_AB(@string varchar(8000))
returns table as return    
with iTally(n) as 
( select top (len(@string)/2) (row_number() over (order by (select null))-1)*2+2
  from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a(x),
       (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b(x)) -- up to 100 character
select txt = substring(reOrder.newString, 1, n)
from iTally
cross apply
(
  select '/'+substring(@string,n,1)
  from iTally
  order by substring(@string,n,1)
  for xml path('')
) reOrder(newString);
GO

性能测试

我正在生成一个ID为100K的行,这样我们就可以知道字符串来自何处.首先是rCTE解决方案,然后是我的每个解决方案都有串行执行计划和并行执行计划(使用跟踪标志8649).

I'm generating 100K rows with an id so we can tell where the string came from. First the rCTE solution then each of my solutions with a serial execution plan and a parallel execution plan (using trace flag 8649).

-- sample data
if object_id('tempdb..#strings') is not null drop table #strings;
create table #strings(id int identity, string varchar(20))

insert #strings(string)
select top (100000)
  isnull(left(stuff(stuff(stuff(stuff(stuff('/'+
  replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
  newid(),'-',''),'0',''),'9',''),'8',''),'7',''),'6',''),'5',''),'4',''),'3',''),'2',''),'1',''),
  3,0,'/'),5,0,'/'),7,0,'/'),9,0,'/'),11,0,'/'),12 - (abs(checksum(newid())%2) * 2)),'/A/B/C/')
from sys.all_columns a, sys.all_columns b;

-- Performance test
set nocount on;
set statistics io, time on

PRINT char(13)+char(10)+'Gordon (unsorted)'+char(13)+char(10)+replicate('-',150);
  declare @throwaway varchar(8000);
  select @throwaway = col
  from #strings s
  cross apply dbo.rCTE_GL(s.string);

PRINT char(13)+char(10)+'Alan (sorted) - serial'+char(13)+char(10)+replicate('-',150);
  select @throwaway = txt
  from #strings s
  cross apply dbo.tally_AB(s.string);

PRINT char(13)+char(10)+'Alan (sorted) - parallel'+char(13)+char(10)+replicate('-',150);
  select @throwaway = txt
  from #strings s
  cross apply dbo.tally_AB(s.string)
  option (querytraceon 8649);

PRINT char(13)+char(10)+'Alan (unsorted) - serial'+char(13)+char(10)+replicate('-',150);
  select @throwaway = leftToken
  from #strings s
  cross apply dbo.edgeNgrams8k(s.string)
  where tokenLen % 2 = 0;

PRINT char(13)+char(10)+'Alan (unsorted) - parallel'+char(13)+char(10)+replicate('-',150);
  select @throwaway = leftToken
  from #strings s
  cross apply dbo.edgeNgrams8k(s.string)
  where tokenLen % 2 = 0
  option (querytraceon 8649);
set statistics io, time off;
GO

结果

Gordon (unsorted)
------------------------------------------------------------------------------------------------------------------------------------------------------
Table 'Worktable'. **Scan count 100001, logical reads 3492199,** physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#strings____________________________________________________________________________________________________________00000000004C'. 
Scan count 1, logical reads 346, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   **CPU time = 4625 ms,  elapsed time = 4721 ms.**

Alan (sorted) - serial
------------------------------------------------------------------------------------------------------------------------------------------------------
Table 'Worktable'. **Scan count 20979, logical reads 563853**, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#strings____________________________________________________________________________________________________________00000000004C'. 
Scan count 1, logical reads 346, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   **CPU time = 1782 ms,  elapsed time = 1790 ms.**

Alan (sorted) - parallel
------------------------------------------------------------------------------------------------------------------------------------------------------
Table '#strings____________________________________________________________________________________________________________00000000004C'. 
Scan count 9, logical reads 346, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. **Scan count 20979, logical reads 563860**, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   **CPU time = 3762 ms,  elapsed time = 992 ms.**

Alan (unsorted) - serial
------------------------------------------------------------------------------------------------------------------------------------------------------
Table '#strings____________________________________________________________________________________________________________00000000004C'. 
Scan count 9, logical reads 346, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 **SQL Server Execution Times:
 CPU time = 219 ms,  elapsed time = 217 ms.

Alan (unsorted) - parallel
------------------------------------------------------------------------------------------------------------------------------------------------------
Table '#strings____________________________________________________________________________________________________________00000000004C'. 
Scan count 9, logical reads 346, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

**SQL Server Execution Times:
  CPU time = 393 ms,  elapsed time = 101 ms.**

那里有.根据您的需求和使用的CPU数量,表表解决方案的完成速度比递归CTE快4-40倍,而读取次数却很少.

There you have it. Depending on what you need and how many CPUs you're using, the tally table solution got it done 4-40 times faster than the recursive CTE and with only a fraction of the reads.

这篇关于SQL字符串操作,查找所有排列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆