混合字数串的人性化或自然数排序 [英] Humanized or natural number sorting of mixed word-and-number strings

查看:18
本文介绍了混合字数串的人性化或自然数排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

跟进这个问题Sivaram Chintalapudi,我感兴趣的是在 PostgreSQL 中做 自然 - 或人性化" - 对包含多位数字和单词/字母混合的字符串进行排序.没有固定的模式字符串中的单词和数字,一个字符串中可能有多个多位数字.

Following up on this question by Sivaram Chintalapudi, I'm interested in whether it's practical in PostgreSQL to do natural - or "humanized" - sorting " of strings that contain a mixture of multi-digit numbers and words/letters. There is no fixed pattern of words and numbers in the strings, and there may be more than one multi-digit number in a string.

我见过的唯一一个经常这样做的地方是在 Mac OS 的 Finder 中,它自然地对包含混合数字和单词的文件名进行排序,将20"放在3"之后,而不是之前.

The only place I've seen this done routinely is in the Mac OS's Finder, which sorts filenames containing mixed numbers and words naturally, placing "20" after "3", not before it.

所需的整理顺序将由一种算法生成,该算法将每个字符串在字母数字边界处分成块,然后对每个部分进行排序,将具有正常整理的字母块和数字块视为整数以进行整理.所以:

The collation order desired would be produced by an algorithm that split each string into blocks at letter-number boundaries, then ordered each part, treating letter-blocks with normal collation and number-blocks as integers for collation purposes. So:

'AAA2fred' 会变成 ('AAA',2,'fred')'AAA10bob' 会变成 ('AAA',10,'bob').然后可以根据需要对它们进行排序:

'AAA2fred' would become ('AAA',2,'fred') and 'AAA10bob' would become ('AAA',10,'bob'). These can then be sorted as desired:

regress=# WITH dat AS ( VALUES ('AAA',2,'fred'), ('AAA',10,'bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
     dat      
--------------
 (AAA,2,fred)
 (AAA,10,bob)
(2 rows)

与通常的字符串整理顺序相比:

as compared to the usual string collation ordering:

regress=# WITH dat AS ( VALUES ('AAA2fred'), ('AAA10bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
    dat     
------------
 (AAA10bob)
 (AAA2fred)
(2 rows)

然而,记录比较方法不能推广,因为 Pg 不会比较 ROW(..) 构造或条目数不等的记录.

However, the record comparison approach doesn't generalize because Pg won't compare ROW(..) constructs or records of unequal numbers of entries.

鉴于示例数据在此 SQLFiddle 中,默认的 en_AU.UTF-8 归类会产生顺序:

Given the sample data in this SQLFiddle the default en_AU.UTF-8 collation produces the ordering:

1A, 10A, 2A, AAA10B, AAA11B, AAA1BB, AAA20B, AAA21B, X10C10, X10C2, X1C1, X1C10, X1C3, X1C30, X1C4, X2C1

但我想要:

1A, 2A, 10A, AAA1BB, AAA10B, AAA11B, AAA20B, AAA21B, X1C1, X1C3, X1C4, X1C10, X1C30, X2C1, X10C10, X10C2

我目前正在使用 PostgreSQL 9.1,但只有 9.2 的建议会很好.我对如何实现有效的字符串拆分方法以及如何在所描述的交替字符串然后数字排序规则中比较结果拆分数据的建议感兴趣.或者,当然,使用不需要拆分字符串的完全不同且更好的方法.

I'm working with PostgreSQL 9.1 at the moment, but 9.2-only suggestions would be fine. I'm interested in advice on how to achieve an efficient string-splitting method, and how to then compare the resulting split data in the alternating string-then-number collation described. Or, of course, on entirely different and better approaches that don't require splitting strings.

PostgreSQL 似乎不支持比较器函数,否则这可以通过递归比较器和诸如 ORDER USING comparer_fncomparator(text,text)功能.唉,那个语法是虚构的.

PostgreSQL doesn't seem to support comparator functions, otherwise this could be done fairly easily with a recursive comparator and something like ORDER USING comparator_fn and a comparator(text,text) function. Alas, that syntax is imaginary.

更新: 关于该主题的博文.

推荐答案

以您的测试数据为基础,但这适用于任意数据.这适用于字符串中的任意数量的元素.

Building on your test data, but this works with arbitrary data. This works with any number of elements in the string.

为每个数据库注册一个由一个 text 和一个 integer 值组成的复合类型.我称之为ai:

Register a composite type made up of one text and one integer value once per database. I call it ai:

CREATE TYPE ai AS (a text, i int);

诀窍是从列中的每个值形成一个 ai 数组.

The trick is to form an array of ai from each value in the column.

regexp_matches() 使用模式 (D*)(d*)g 选项为每个组合返回一行的字母和数字.加上一个带有两个空字符串的不相关悬空行 '{"",""}' 过滤或抑制它只会增加成本.在将 integer 组件(如 '')中的空字符串('')替换为 0 后,将其聚合到一个数组中code> 不能转换为 integer).

regexp_matches() with the pattern (D*)(d*) and the g option returns one row for every combination of letters and numbers. Plus one irrelevant dangling row with two empty strings '{"",""}' Filtering or suppressing it would just add cost. Aggregate this into an array, after replacing empty strings ('') with 0 in the integer component (as '' cannot be cast to integer).

NULL 值首先排序 - 或者你必须对它们进行特殊处理 - 或者像@Craig 建议的那样在 STRICT 函数中使用整个 shebang.

NULL values sort first - or you have to special case them - or use the whole shebang in a STRICT function like @Craig proposes.

SELECT data
FROM   alnum
ORDER  BY ARRAY(SELECT ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai
                FROM regexp_matches(data, '(D*)(d*)', 'g') x)
        , data;

db<>fiddle 这里

使用 PostgreSQL 9.1.5 进行测试,其中 regexp_replace() 的行为略有不同.

Tested with PostgreSQL 9.1.5, where regexp_replace() had a slightly different behavior.

SELECT data
FROM  (
    SELECT ctid, data, regexp_matches(data, '(D*)(d*)', 'g') AS x
    FROM   alnum
    ) x
GROUP  BY ctid, data   -- ctid as stand-in for a missing pk
ORDER  BY regexp_replace (left(data, 1), '[0-9]', '0')
        , array_agg(ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai)
        , data         -- for special case of trailing 0

添加 regexp_replace (left(data, 1), '[1-9]', '0') 作为第一个 ORDER BY 项来处理前导数字和空字符串.

Add regexp_replace (left(data, 1), '[1-9]', '0') as first ORDER BY item to take care of leading digits and empty strings.

如果像 {}()"', 这样的特殊字符可能出现,你必须相应地转义它们.
@Craig 建议使用 ROW 表达式来解决这个问题.

If special characters like {}()"', can occur, you'd have to escape those accordingly.
@Craig's suggestion to use a ROW expression takes care of that.

<打击>顺便说一句,这不会在 sqlfiddle 中执行,但它会在我的数据库集群中执行.JDBC 达不到它的要求.sqlfiddle 抱怨:

BTW, this won't execute in sqlfiddle, but it does in my db cluster. JDBC is not up to it. sqlfiddle complains:

方法 org.postgresql.jdbc3.Jdbc3Array.getArrayImpl(long,int,Map) 是尚未实现.

Method org.postgresql.jdbc3.Jdbc3Array.getArrayImpl(long,int,Map) is not yet implemented.

此后已修复:http://sqlfiddle.com/#!17/fad6e/1

这篇关于混合字数串的人性化或自然数排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆