Unicode正则表达式以匹配换行符? [英] Unicode regexp to match line-breaks?
问题描述
我有一张要从中提交数据到数据库的表格.数据为UTF8.我在匹配换行符时遇到了麻烦.我使用的模式是这样的:
I have this form from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this:
~^[\p{L}\p{M}\p{N} ]+$~u
在用户在其文本框中输入新行之前,此模式可以正常工作.我尝试在类中使用\p{Z}
,但没有成功.我也尝试过"s",但没有用.
This pattern works fine until the user puts a new line in his text box. I have tried using \p{Z}
inside the class but with no success. I also tried "s" but it didn’t work.
我们非常感谢您的帮助.谢谢!
Any help is much appreciated. Thanks!
推荐答案
Unicode换行符是立即返回的换行符,后跟换行符,否则它是具有垂直空格属性的任何字符.
A Unicode linebreak is either a carriage return immediately followed by a line feed, or else it is any character with the vertical whitespace property.
但是您似乎在尝试匹配通用空格.在Java中,应该是
But it looks like you’re trying to match generic whitespace there. In Java, that would be
[\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u2029\u202F\u205F\u3000]
可以通过使用范围仅"将其缩短:
which can be shortened by using ranges to "only" this:
[\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
包括水平空白(\h
)和垂直空白(\v
),它们可以与一般空白(\s
)相同或不相同.
to include both horizontal whitespace (\h
) and vertical whitespace (\v
), which may or may not be the same as general whitespace (\s
).
您似乎还想匹配字母数字.
It also looks like you’re trying to match alphanumerics.
- 单独的字母通常是
[\pL\pM\p{Nl}]
. - 数字并不总是全部
\pN
,而是仅仅是\p{Nd}
或有时是[\p{Nd}\p{Nl}]
. - 标识符字符需要连接器标点符号以及更多一些内容,因此
[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
-如果您的正则表达式引擎支持这些操作(Java的话).这就是\w
在支持Unicode的正则表达式语言(不是Java的正则表达式语言)中的作用.
- Alphabetics alone are usually
[\pL\pM\p{Nl}]
. - Numerics are not so often all
\pN
as often as they are either just\p{Nd}
or else sometimes[\p{Nd}\p{Nl}]
. - Identifer characters need connector punctuation and a bit more, so
[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
— if your regex engine supports those sorts of operations (Java’s does). That’s what\w
works out to in Unicode-aware regex languages (of which Java is not one).
在旧版本的Perl中,您可能会将换行符写为
In older versions of Perl, you would likely write a linebreak as
(?:\r\n|\p{VertSpace})
尽管现在最好写成
(?:(?>\r\n)|\v)
这正是
\R
匹配.
Java在这些方面非常笨拙.在那里,您必须将换行符写为
Java is very clumsy at these things. There you must write a linebreak as
(?:(?>\u000D\u000A)|[\u000A-\u000D\u0085\u2028\u2029])
以字符串形式编写时,当然需要额外的bbaacckkssllasshheess.
which of course requires extra bbaacckkssllasshheess when written as a string.
14个常见字符类正则表达式转义的其他Java等价物使它们与Unicode一起工作,我给出了
The other Java equivalences for the 14 common character-class regex escapes so that they work with Unicode I give in this answer. You may have to use those in other Java-like regex languages that aren’t sufficiently Unicode-aware.
这篇关于Unicode正则表达式以匹配换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!