使用PostgreSQL修剪尾随空格 [英] Trim trailing spaces with PostgreSQL

查看:277
本文介绍了使用PostgreSQL修剪尾随空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个列 eventDate ,其中包含尾随空格。我正在尝试使用PostgreSQL函数 TRIM()删除它们。更具体地说,我正在运行:

  SELECT TRIM(均为eventDate的’)
FROM EventDates;

但是,尾随空格不会消失。此外,当我尝试从日期中修剪另一个字符(例如数字)时,它也不会修剪。如果我正在正确阅读手册,则此方法应该可以正常工作。有什么想法吗?

解决方案

有许多不同的不可见字符。他们中的许多人在Unicode中都具有属性 WSpace = Y (空白)。但是某些特殊字符不被视为空白,并且仍然没有可见的表示形式。关于空间(标点符号)空白字符应该给您一个想法。



< rant> Unicode在这方面很烂:引入了许多主要用来使人们感到困惑的奇特字符。< / rant>



标准SQL trim()函数默认情况下仅修剪基本的拉丁空格字符(Unicode:U + 0020 / ASCII 32)。与 rtrim( ) ltrim() 变体。您的呼叫也仅针对该特定字符。



使用正则表达式和 regexp_replace()



尾随



要删除所有尾随的空白 (而不是内部空白 字符串):

  SELECT regexp_replace(eventdate,'\s + $','')FROM eventdates; 

正则表达式说明:

\s ..正则表达式类的缩写,表示 [[:: space:]]

     -这是一组空白字符-请参见以下限制

+ .. 1个或多个连续匹配项

$ ..字符串结尾



演示:

  SELECT regexp_replace('inner white','\s + $','')|| '|'

返回值:

 内白| 

是的,这是一个反斜杠( \ \ )。





Leading h3>

要删除 所有前导空格 (但不能删除字符串内的空白):

  regexp_replace(eventdate,'^ \s +','')

^ ..字符串的开头



两者



要删除 两者 ,可以在函数调用上方进行链接:

  regexp_replace(regexp_replace(eventdate,'^ \s +',''),'\s + $','')

或者您可以将它们与两个 分支

添加'g'作为要替换的第四个参数在所有比赛中获得王牌,而不仅仅是第一个:

  regexp_replace(eventdate,'^ \s + | \s + $',' ','g')

但是通常使用 substring()

 子字符串(eventdate,'\S(?:。* \S)*')

\S ..一切但是空格

(?: re 不捕获的一组括号

。* ..任何0-n个字符的字符串



或以下之一:

 子字符串(事件日期,'^ \s *(。* \ \S)')
子字符串(eventdate,'(\S。* \S)')

re .. 捕获括号集



有效地获取第一个非空白字符,并获取所有内容,直到最后一个非空白字符。



空白?

还有其他一些与之相关的字符,它们未归类为Unicode中的空白 -因此不包含在字符类 [[:: space:]] 中。



对我来说,这些打印为pgAdmin中的不可见字形:蒙古语元音,零宽度空间,零宽度非连接符,零宽度连接符:

  SELECT E'\u180e',E'\u200B',E'\u200C',E'\u200D'; 

‘᠎’| ’| | ‘‌’| '‍'

另外两个,在pgAdmin中打印为 visible 字形,但是在我的浏览器中不可见:单词连接器,零宽度不间断空格:

  SELECT E'\u2060' ,E'\uFEFF'; 
’⁠’| ''

最终,是否使字符不可见也取决于用于显示的字体。 / p>

要同时删除所有 ,请替换'\s''[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]''[\s ᠎́‌‍⁠]'(注意尾随不可见字符!)。

示例,而不是:

  regexp_replace(eventdate,'\s + $','')

使用:

  regexp_replace(eventdate,'[\s\u180e\u200B\u200C \u200D\u2060\uFEFF] + $','')

或:

  regexp_replace(eventdate,'[\s᠎‌‍⁠] + $','')-注意不可见字符



限制



还有 Posix字符类 [[:graph:]] 应该代表可见字符。例如:

 子字符串(事件日期,'([[[:graph:]]。* [[:graph:]])' )

它在每种设置中都能可靠地处理ASCII字符(归结为 [\x21-\x7E] ),但除此之外,您目前(包括第10页)依赖于底层操作系统提供的信息(定义 ctype )以及可能的语言环境设置。



严格来说,每个引用字符类都是这种情况,但似乎与不太常用的 graph 意见相左。但是您可能必须向字符类 [[:space:]] (速记 \s )以捕获所有空白字符。 例如: \u2007 \u202f \u00a0 >。



手册:


在方括号表达式中,字符类的名称包含在$ b中$ b [::] 代表属于该
类的所有字符的列表。标准字符类名称为:数字 alpha 空白 cntrl
位数图形下部打印打孔空格 xdigit
这些代表ctype中定义的 character类。
语言环境可以提供其他语言。


强调粗体。



还请注意,此限制为已修复Postgres 10


修复大字符
代码的正则表达式字符类处理,尤其是 U + 7FF (Tom Lane)以上的Unicode字符



以前,此类字符从未被视为属于
与语言环境有关的字符类,例如 [[:: alpha:]]



I have a column eventDate which contains trailing spaces. I am trying to remove them with the PostgreSQL function TRIM(). More specifically, I am running:

SELECT TRIM(both ' ' from eventDate) 
FROM EventDates;

However, the trailing spaces don't go away. Furthermore, when I try and trim another character from the date (such as a number), it doesn't trim either. If I'm reading the manual correctly this should work. Any thoughts?

解决方案

There are many different invisible characters. Many of them have the property WSpace=Y ("whitespace") in Unicode. But some special characters are not considered "whitespace" and still have no visible representation. The excellent Wikipedia articles about space (punctuation) and whitespace characters should give you an idea.

<rant>Unicode sucks in this regard: introducing lots of exotic characters that mainly serve to confuse people.</rant>

The standard SQL trim() function by default only trims the basic Latin space character (Unicode: U+0020 / ASCII 32). Same with the rtrim() and ltrim() variants. Your call also only targets that particular character.

Use regular expressions with regexp_replace() instead.

Trailing

To remove all trailing white space (but not white space inside the string):

SELECT regexp_replace(eventdate, '\s+$', '') FROM eventdates;

The regular expression explained:
\s .. regular expression class shorthand for [[:space:]]
    - which is the set of white-space characters - see limitations below
+ .. 1 or more consecutive matches
$ .. end of string

Demo:

SELECT regexp_replace('inner white   ', '\s+$', '') || '|'

Returns:

inner white|

Yes, that's a single backslash (\). Details in this related answer.

Leading

To remove all leading white space (but not white space inside the string):

regexp_replace(eventdate, '^\s+', '')

^ .. start of string

Both

To remove both, you can chain above function calls:

regexp_replace(regexp_replace(eventdate, '^\s+', ''), '\s+$', '')

Or you can combine both in a single call with two branches.
Add 'g' as 4th parameter to replace all matches, not just the first:

regexp_replace(eventdate, '^\s+|\s+$', '', 'g')

But that should typically be faster with substring():

substring(eventdate, '\S(?:.*\S)*')

\S .. everything but white space
(?:re) Non-capturing set of parentheses
.* .. any string of 0-n characters

Or one of these:

substring(eventdate, '^\s*(.*\S)')
substring(eventdate, '(\S.*\S)')

(re) .. Capturing set of parentheses

Effectively takes the first non-whitespace character and everything up to the last non-whitespace character if available.

Whitespace?

There are a few more related characters which are not classified as "whitespace" in Unicode - so not contained in the character class [[:space:]].

These print as invisible glyphs in pgAdmin for me: "mongolian vowel", "zero width space", "zero width non-joiner", "zero width joiner":

SELECT E'\u180e', E'\u200B', E'\u200C', E'\u200D';

'᠎' | '​' | '‌' | '‍'

Two more, printing as visible glyphs in pgAdmin, but invisible in my browser: "word joiner", "zero width non-breaking space":

SELECT E'\u2060', E'\uFEFF';
'⁠' | ''

Ultimately, whether characters are rendered invisible or not also depends on the font used for display.

To remove all of these as well, replace '\s' with '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]' or '[\s᠎​‌‍⁠]' (note trailing invisible characters!).
Example, instead of:

regexp_replace(eventdate, '\s+$', '')

use:

regexp_replace(eventdate, '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]+$', '')

or:

regexp_replace(eventdate, '[\s᠎​‌‍⁠]+$', '')  -- note invisible characters

Limitations

There is also the Posix character class [[:graph:]] supposed to represent "visible characters". Example:

substring(eventdate, '([[:graph:]].*[[:graph:]])')

It works reliably for ASCII characters in every setup (where it boils down to [\x21-\x7E]), but beyond that you currently (incl. pg 10) depend on information provided by the underlying OS (to define ctype) and possibly locale settings.

Strictly speaking, that's the case for every reference to a character class, but there seems to be more disagreement with the less commonly used ones like graph. But you may have to add more characters to the character class [[:space:]] (shorthand \s) to catch all whitespace characters. Like: \u2007, \u202f and \u00a0 seem to also be missing for @XiCoN JFS.

The manual:

Within a bracket expression, the name of a character class enclosed in [: and :] stands for the list of all characters belonging to that class. Standard character class names are: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit. These stand for the character classes defined in ctype. A locale can provide others.

Bold emphasis mine.

Also note this limitation that was fixed with Postgres 10:

Fix regular expressions' character class handling for large character codes, particularly Unicode characters above U+7FF (Tom Lane)

Previously, such characters were never recognized as belonging to locale-dependent character classes such as [[:alpha:]].

这篇关于使用PostgreSQL修剪尾随空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆