使用PostgreSQL修剪尾随空格 [英] Trim trailing spaces with PostgreSQL
问题描述
我有一个列 eventDate
,其中包含尾随空格。我正在尝试使用PostgreSQL函数 TRIM()
删除它们。更具体地说,我正在运行:
SELECT TRIM(均为eventDate的’)
FROM EventDates;
但是,尾随空格不会消失。此外,当我尝试从日期中修剪另一个字符(例如数字)时,它也不会修剪。如果我正在正确阅读手册,则此方法应该可以正常工作。有什么想法吗?
有许多不同的不可见字符。他们中的许多人在Unicode中都具有属性 WSpace = Y
(空白)。但是某些特殊字符不被视为空白,并且仍然没有可见的表示形式。关于空间(标点符号)和空白字符应该给您一个想法。
< rant> Unicode在这方面很烂:引入了许多主要用来使人们感到困惑的奇特字符。< / rant>
标准SQL trim()
函数默认情况下仅修剪基本的拉丁空格字符(Unicode:U + 0020 / ASCII 32)。与 rtrim( )
和 ltrim()
变体。您的呼叫也仅针对该特定字符。
使用正则表达式和 regexp_replace()
。
尾随
要删除所有尾随的空白 (而不是内部空白 字符串):
SELECT regexp_replace(eventdate,'\s + $','')FROM eventdates;
正则表达式说明:
\s
..正则表达式类的缩写,表示 [[:: space:]]
-这是一组空白字符-请参见以下限制
+
.. 1个或多个连续匹配项
$
..字符串结尾
演示:
SELECT regexp_replace('inner white','\s + $','')|| '|'
返回值:
内白|
是的,这是一个单反斜杠( \ \
)。
Leading h3>
要删除 所有前导空格 (但不能删除字符串内的空白):
regexp_replace(eventdate,'^ \s +','')
^
..字符串的开头
两者
要删除 两者 ,可以在函数调用上方进行链接:
regexp_replace(regexp_replace(eventdate,'^ \s +',''),'\s + $','')
或者您可以将它们与两个 分支 。
添加'g'
作为要替换的第四个参数在所有比赛中获得王牌,而不仅仅是第一个:
regexp_replace(eventdate,'^ \s + | \s + $',' ','g')
但是通常使用 substring()
:
子字符串(eventdate,'\S(?:。* \S)*')
\S
..一切但是空格
(?:
re
)
不捕获的一组括号
。*
..任何0-n个字符的字符串
或以下之一:
子字符串(事件日期,'^ \s *(。* \ \S)')
子字符串(eventdate,'(\S。* \S)')
(
re
)
..
有效地获取第一个非空白字符,并获取所有内容,直到最后一个非空白字符。
空白?
还有其他一些与之相关的字符,它们未归类为Unicode中的空白 -因此不包含在字符类 [[:: space:]]
中。
对我来说,这些打印为pgAdmin中的不可见字形:蒙古语元音,零宽度空间,零宽度非连接符,零宽度连接符:
SELECT E'\u180e',E'\u200B',E'\u200C',E'\u200D';
‘’| ’| | ‘’| ''
另外两个,在pgAdmin中打印为 visible 字形,但是在我的浏览器中不可见:单词连接器,零宽度不间断空格:
SELECT E'\u2060' ,E'\uFEFF';
’’| ''
最终,是否使字符不可见也取决于用于显示的字体。 / p>
要同时删除所有 ,请替换'\s'
和'[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]'
或'[\s ́]'
(注意尾随不可见字符!)。
示例,而不是:
regexp_replace(eventdate,'\s + $','')
使用:
regexp_replace(eventdate,'[\s\u180e\u200B\u200C \u200D\u2060\uFEFF] + $','')
或:
regexp_replace(eventdate,'[\s] + $','')-注意不可见字符
限制
还有 Posix字符类 [[:graph:]]
应该代表可见字符。例如:
子字符串(事件日期,'([[[:graph:]]。* [[:graph:]])' )
它在每种设置中都能可靠地处理ASCII字符(归结为 [\x21-\x7E]
),但除此之外,您目前(包括第10页)依赖于底层操作系统提供的信息(定义 ctype
)以及可能的语言环境设置。
严格来说,每个引用字符类都是这种情况,但似乎与不太常用的 graph 意见相左。但是您可能必须向字符类 [[:space:]]
(速记 \s
)以捕获所有空白字符。 例如: \u2007 $ @XiCoN JFS似乎也缺少c $ c>,
\u202f
和 \u00a0
>。
在方括号表达式中,字符类的名称包含在$ b中$ b
[:
和:]
代表属于该
类的所有字符的列表。标准字符类名称为:数字
,alpha
,空白
,cntrl
,
位数
,图形
,下部
,打印
,打孔
,空格
,上
,xdigit
。
这些代表ctype中定义的 character类。
语言环境可以提供其他语言。
强调粗体。
还请注意,此限制为已修复Postgres 10 :
修复大字符
代码的正则表达式字符类处理,尤其是U + 7FF
(Tom Lane)以上的Unicode字符
以前,此类字符从未被视为属于
与语言环境有关的字符类,例如[[:: alpha:]]
。
I have a column eventDate
which contains trailing spaces. I am trying to remove them with the PostgreSQL function TRIM()
. More specifically, I am running:
SELECT TRIM(both ' ' from eventDate)
FROM EventDates;
However, the trailing spaces don't go away. Furthermore, when I try and trim another character from the date (such as a number), it doesn't trim either. If I'm reading the manual correctly this should work. Any thoughts?
There are many different invisible characters. Many of them have the property WSpace=Y
("whitespace") in Unicode. But some special characters are not considered "whitespace" and still have no visible representation. The excellent Wikipedia articles about space (punctuation) and whitespace characters should give you an idea.
<rant>Unicode sucks in this regard: introducing lots of exotic characters that mainly serve to confuse people.</rant>
The standard SQL trim()
function by default only trims the basic Latin space character (Unicode: U+0020 / ASCII 32). Same with the rtrim()
and ltrim()
variants. Your call also only targets that particular character.
Use regular expressions with regexp_replace()
instead.
Trailing
To remove all trailing white space (but not white space inside the string):
SELECT regexp_replace(eventdate, '\s+$', '') FROM eventdates;
The regular expression explained:
\s
.. regular expression class shorthand for [[:space:]]
- which is the set of white-space characters - see limitations below
+
.. 1 or more consecutive matches
$
.. end of string
Demo:
SELECT regexp_replace('inner white ', '\s+$', '') || '|'
Returns:
inner white|
Yes, that's a single backslash (\
). Details in this related answer.
Leading
To remove all leading white space (but not white space inside the string):
regexp_replace(eventdate, '^\s+', '')
^
.. start of string
Both
To remove both, you can chain above function calls:
regexp_replace(regexp_replace(eventdate, '^\s+', ''), '\s+$', '')
Or you can combine both in a single call with two branches.
Add 'g'
as 4th parameter to replace all matches, not just the first:
regexp_replace(eventdate, '^\s+|\s+$', '', 'g')
But that should typically be faster with substring()
:
substring(eventdate, '\S(?:.*\S)*')
\S
.. everything but white space
(?:
re
)
Non-capturing set of parentheses
.*
.. any string of 0-n characters
Or one of these:
substring(eventdate, '^\s*(.*\S)')
substring(eventdate, '(\S.*\S)')
(
re
)
.. Capturing set of parentheses
Effectively takes the first non-whitespace character and everything up to the last non-whitespace character if available.
Whitespace?
There are a few more related characters which are not classified as "whitespace" in Unicode - so not contained in the character class [[:space:]]
.
These print as invisible glyphs in pgAdmin for me: "mongolian vowel", "zero width space", "zero width non-joiner", "zero width joiner":
SELECT E'\u180e', E'\u200B', E'\u200C', E'\u200D';
'' | '' | '' | ''
Two more, printing as visible glyphs in pgAdmin, but invisible in my browser: "word joiner", "zero width non-breaking space":
SELECT E'\u2060', E'\uFEFF';
'' | ''
Ultimately, whether characters are rendered invisible or not also depends on the font used for display.
To remove all of these as well, replace '\s'
with '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]'
or '[\s]'
(note trailing invisible characters!).
Example, instead of:
regexp_replace(eventdate, '\s+$', '')
use:
regexp_replace(eventdate, '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]+$', '')
or:
regexp_replace(eventdate, '[\s]+$', '') -- note invisible characters
Limitations
There is also the Posix character class [[:graph:]]
supposed to represent "visible characters". Example:
substring(eventdate, '([[:graph:]].*[[:graph:]])')
It works reliably for ASCII characters in every setup (where it boils down to [\x21-\x7E]
), but beyond that you currently (incl. pg 10) depend on information provided by the underlying OS (to define ctype
) and possibly locale settings.
Strictly speaking, that's the case for every reference to a character class, but there seems to be more disagreement with the less commonly used ones like graph. But you may have to add more characters to the character class [[:space:]]
(shorthand \s
) to catch all whitespace characters. Like: \u2007
, \u202f
and \u00a0
seem to also be missing for @XiCoN JFS.
Within a bracket expression, the name of a character class enclosed in
[:
and:]
stands for the list of all characters belonging to that class. Standard character class names are:alnum
,alpha
,blank
,cntrl
,digit
,graph
,lower
,punct
,space
,upper
,xdigit
. These stand for the character classes defined in ctype. A locale can provide others.
Bold emphasis mine.
Also note this limitation that was fixed with Postgres 10:
Fix regular expressions' character class handling for large character codes, particularly Unicode characters above
U+7FF
(Tom Lane)Previously, such characters were never recognized as belonging to locale-dependent character classes such as
[[:alpha:]]
.
这篇关于使用PostgreSQL修剪尾随空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!