Unicode正则表达式以匹配换行符? [英] Unicode regexp to match line-breaks?

查看:230
本文介绍了Unicode正则表达式以匹配换行符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张要从中提交数据到数据库的表格.数据为UTF8.我在匹配换行符时遇到了麻烦.我使用的模式是这样的:

I have this form from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this:

~^[\p{L}\p{M}\p{N} ]+$~u

在用户在其文本框中输入新行之前,此模式可以正常工作.我尝试在类中使用\p{Z},但没有成功.我也尝试过"s",但没有用.

This pattern works fine until the user puts a new line in his text box. I have tried using \p{Z} inside the class but with no success. I also tried "s" but it didn’t work.

我们非常感谢您的帮助.谢谢!

Any help is much appreciated. Thanks!

推荐答案

Unicode换行符是立即返回的换行符,后跟换行符,否则它是具有垂直空格属性的任何字符.

A Unicode linebreak is either a carriage return immediately followed by a line feed, or else it is any character with the vertical whitespace property.

但是您似乎在尝试匹配通用空格.在Java中,应该是

But it looks like you’re trying to match generic whitespace there. In Java, that would be

 [\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u2029\u202F\u205F\u3000]

可以通过使用范围仅"将其缩短:

which can be shortened by using ranges to "only" this:

 [\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

包括水平空白(\h)和垂直空白(\v),它们可以与一般空白(\s)相同或不相同.

to include both horizontal whitespace (\h) and vertical whitespace (\v), which may or may not be the same as general whitespace (\s).

您似乎还想匹配字母数字.

It also looks like you’re trying to match alphanumerics.

  • 单独的字母通常是[\pL\pM\p{Nl}].
  • 数字并不总是全部\pN,而是仅仅是\p{Nd}或有时是[\p{Nd}\p{Nl}].
  • 标识符字符需要连接器标点符号以及更多一些内容,因此[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]-如果您的正则表达式引擎支持这些操作(Java的话).这就是\w在支持Unicode的正则表达式语言(不是Java的正则表达式语言)中的作用.
  • Alphabetics alone are usually [\pL\pM\p{Nl}].
  • Numerics are not so often all \pN as often as they are either just \p{Nd} or else sometimes [\p{Nd}\p{Nl}].
  • Identifer characters need connector punctuation and a bit more, so [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] — if your regex engine supports those sorts of operations (Java’s does). That’s what \w works out to in Unicode-aware regex languages (of which Java is not one).

在旧版本的Perl中,您可能会将换行符写为

In older versions of Perl, you would likely write a linebreak as

 (?:\r\n|\p{VertSpace})

尽管现在最好写成

 (?:(?>\r\n)|\v)

这正是

 \R

匹配.

Java在这些方面非常笨拙.在那里,您必须将换行符写为

Java is very clumsy at these things. There you must write a linebreak as

  (?:(?>\u000D\u000A)|[\u000A-\u000D\u0085\u2028\u2029])

以字符串形式编写时,当然需要额外的bbaacckkssllasshheess.

which of course requires extra bbaacckkssllasshheess when written as a string.

14个常见字符类正则表达式转义的其他Java等价物使它们与Unicode一起工作,我给出了

The other Java equivalences for the 14 common character-class regex escapes so that they work with Unicode I give in this answer. You may have to use those in other Java-like regex languages that aren’t sufficiently Unicode-aware.

这篇关于Unicode正则表达式以匹配换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆