人字或数字混合字符串的人性化或自然数排序 [英] Humanized or natural number sorting of mixed word-and-number strings

查看:122
本文介绍了人字或数字混合字符串的人性化或自然数排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

按照)包含多位数字和单词/字母的混合字符串.字符串中的单词和数字,字符串中可能有多个多位数字.

我唯一看到的常规操作是在Mac OS的Finder中,它对包含混合数字和单词的文件名进行自然排序,将"20"放在"3"之后,而不是在它之前.

所需的排序顺序将由一种算法产生,该算法将每个字符串分成字母数字边界处的块,然后对每个部分进行排序,将具有正常排序规则的字母块和数字块视为用于排序目的的整数.所以:

'AAA2fred'将变为('AAA',2,'fred'),而'AAA10bob'将变为('AAA',10,'bob').然后可以根据需要对它们进行排序:

regress=# WITH dat AS ( VALUES ('AAA',2,'fred'), ('AAA',10,'bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
     dat      
--------------
 (AAA,2,fred)
 (AAA,10,bob)
(2 rows)

与通常的字符串排序规则相比:

regress=# WITH dat AS ( VALUES ('AAA2fred'), ('AAA10bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
    dat     
------------
 (AAA10bob)
 (AAA2fred)
(2 rows)

但是,记录比较方法不能一概而论,因为Pg不会比较ROW(..)构造或条目数量不相等的记录.

给出示例数据在此SQLFiddle中,默认的en_AU.UTF-8归类产生顺序:

1A, 10A, 2A, AAA10B, AAA11B, AAA1BB, AAA20B, AAA21B, X10C10, X10C2, X1C1, X1C10, X1C3, X1C30, X1C4, X2C1

但我想要:

1A, 2A, 10A, AAA1BB, AAA10B, AAA11B, AAA20B, AAA21B, X1C1, X1C3, X1C4, X1C10, X1C30, X2C1, X10C10, X10C2

目前我正在使用PostgreSQL 9.1,但是仅9.2的建议会很好.我对如何实现有效的字符串拆分方法以及如何在上述交替的字符串-然后-数字排序规则中比较生成的拆分数据的建议感兴趣.或者,当然,在不需要分割字符串的完全不同且更好的方法上.

PostgreSQL似乎不支持比较器功能,否则可以使用递归比较器以及类似ORDER USING comparator_fncomparator(text,text)的功能相当容易地完成. syntax,这种语法是虚构的.

更新:解决方案

以测试数据为基础,但这适用于任意数据.这适用于字符串中任意数量的元素.

每个数据库一次注册一个由一个text和一个integer值组成的复合类型.我称它为ai:

CREATE TYPE ai AS (a text, i int);

诀窍是从列中的每个值组成一个ai的数组.

具有模式(\D*)(\d*)g选项的

regexp_matches()对于字母和数字的每种组合都返回一行.再加上一个不相关的悬挂行,其中包含两个空字符串'{"",""}'过滤或抑制它只会增加成本.将空字符串('')用integer组件中的0替换后,将其聚合到数组中(因为''无法转换为integer).

NULL值首先排序-或者您必须对它们进行特殊处理-或在整个@craig建议的STRICT函数中使用全部shebang.

Postgres 9.4或更高版本

SELECT data
FROM   alnum
ORDER  BY ARRAY(SELECT ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai
                FROM regexp_matches(data, '(\D*)(\d*)', 'g') x)
        , data;

db<>小提琴此处

Postgres 9.1(原始答案)

在PostgreSQL 9.1.5上进行了测试,其中regexp_replace()的行为略有不同.

SELECT data
FROM  (
    SELECT ctid, data, regexp_matches(data, '(\D*)(\d*)', 'g') AS x
    FROM   alnum
    ) x
GROUP  BY ctid, data   -- ctid as stand-in for a missing pk
ORDER  BY regexp_replace (left(data, 1), '[0-9]', '0')
        , array_agg(ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai)
        , data         -- for special case of trailing 0

添加regexp_replace (left(data, 1), '[1-9]', '0')作为第一个ORDER BY项,以照顾到前导数字和空字符串.

如果可能出现诸如{}()"',之类的特殊字符,则必须相应地转义这些字符.
@Craig建议使用ROW表达式可以解决这个问题.

<罢工> 顺便说一句,这不会在sqlfiddle中执行,但是会在我的数据库集群中执行. JDBC无法做到这一点. sqlfiddle抱怨:

方法org.postgresql.jdbc3.Jdbc3Array.getArrayImpl(long,int,Map)是 尚未实现.

此后已解决: http://sqlfiddle.com/#!17/fad6e /1

Following up on this question by Sivaram Chintalapudi, I'm interested in whether it's practical in PostgreSQL to do natural - or "humanized" - sorting " of strings that contain a mixture of multi-digit numbers and words/letters. There is no fixed pattern of words and numbers in the strings, and there may be more than one multi-digit number in a string.

The only place I've seen this done routinely is in the Mac OS's Finder, which sorts filenames containing mixed numbers and words naturally, placing "20" after "3", not before it.

The collation order desired would be produced by an algorithm that split each string into blocks at letter-number boundaries, then ordered each part, treating letter-blocks with normal collation and number-blocks as integers for collation purposes. So:

'AAA2fred' would become ('AAA',2,'fred') and 'AAA10bob' would become ('AAA',10,'bob'). These can then be sorted as desired:

regress=# WITH dat AS ( VALUES ('AAA',2,'fred'), ('AAA',10,'bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
     dat      
--------------
 (AAA,2,fred)
 (AAA,10,bob)
(2 rows)

as compared to the usual string collation ordering:

regress=# WITH dat AS ( VALUES ('AAA2fred'), ('AAA10bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
    dat     
------------
 (AAA10bob)
 (AAA2fred)
(2 rows)

However, the record comparison approach doesn't generalize because Pg won't compare ROW(..) constructs or records of unequal numbers of entries.

Given the sample data in this SQLFiddle the default en_AU.UTF-8 collation produces the ordering:

1A, 10A, 2A, AAA10B, AAA11B, AAA1BB, AAA20B, AAA21B, X10C10, X10C2, X1C1, X1C10, X1C3, X1C30, X1C4, X2C1

but I want:

1A, 2A, 10A, AAA1BB, AAA10B, AAA11B, AAA20B, AAA21B, X1C1, X1C3, X1C4, X1C10, X1C30, X2C1, X10C10, X10C2

I'm working with PostgreSQL 9.1 at the moment, but 9.2-only suggestions would be fine. I'm interested in advice on how to achieve an efficient string-splitting method, and how to then compare the resulting split data in the alternating string-then-number collation described. Or, of course, on entirely different and better approaches that don't require splitting strings.

PostgreSQL doesn't seem to support comparator functions, otherwise this could be done fairly easily with a recursive comparator and something like ORDER USING comparator_fn and a comparator(text,text) function. Alas, that syntax is imaginary.

Update: Blog post on the topic.

解决方案

Building on your test data, but this works with arbitrary data. This works with any number of elements in the string.

Register a composite type made up of one text and one integer value once per database. I call it ai:

CREATE TYPE ai AS (a text, i int);

The trick is to form an array of ai from each value in the column.

regexp_matches() with the pattern (\D*)(\d*) and the g option returns one row for every combination of letters and numbers. Plus one irrelevant dangling row with two empty strings '{"",""}' Filtering or suppressing it would just add cost. Aggregate this into an array, after replacing empty strings ('') with 0 in the integer component (as '' cannot be cast to integer).

NULL values sort first - or you have to special case them - or use the whole shebang in a STRICT function like @Craig proposes.

Postgres 9.4 or later

SELECT data
FROM   alnum
ORDER  BY ARRAY(SELECT ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai
                FROM regexp_matches(data, '(\D*)(\d*)', 'g') x)
        , data;

db<>fiddle here

Postgres 9.1 (original answer)

Tested with PostgreSQL 9.1.5, where regexp_replace() had a slightly different behavior.

SELECT data
FROM  (
    SELECT ctid, data, regexp_matches(data, '(\D*)(\d*)', 'g') AS x
    FROM   alnum
    ) x
GROUP  BY ctid, data   -- ctid as stand-in for a missing pk
ORDER  BY regexp_replace (left(data, 1), '[0-9]', '0')
        , array_agg(ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai)
        , data         -- for special case of trailing 0

Add regexp_replace (left(data, 1), '[1-9]', '0') as first ORDER BY item to take care of leading digits and empty strings.

If special characters like {}()"', can occur, you'd have to escape those accordingly.
@Craig's suggestion to use a ROW expression takes care of that.

BTW, this won't execute in sqlfiddle, but it does in my db cluster. JDBC is not up to it. sqlfiddle complains:

Method org.postgresql.jdbc3.Jdbc3Array.getArrayImpl(long,int,Map) is not yet implemented.

This has since been fixed: http://sqlfiddle.com/#!17/fad6e/1

这篇关于人字或数字混合字符串的人性化或自然数排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆