Utf8在Perl中为CamelCase(WikiWord)正确的正则表达式 [英] Utf8 correct regex for CamelCase (WikiWord) in perl

查看:104
本文介绍了Utf8在Perl中为CamelCase(WikiWord)正确的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是关于 CamelCase regex 的问题.与 tchrist post 的组合我想知道什么是正确的 utf-8 CamelCase .

Here was a question about the CamelCase regex. With the combination of tchrist post i'm wondering what is the correct utf-8 CamelCase.

以(brian d foy's)正则表达式开始:

Starting with (brian d foy's) regex:

/
    \b          # start at word boundary
    [A-Z]       # start with upper
    [a-zA-Z]*   # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       [a-z][a-zA-Z]*[A-Z]   # next bit is lower, any zero or more, ending with upper
          |                     # or 
       [A-Z][a-zA-Z]*[a-z]   # next bit is upper, any zero or more, ending with lower
    )

    [a-zA-Z]*   # anything that's left
    \b          # end at word 
/x

并修改为:

/
    \b          # start at word boundary
    \p{Uppercase_Letter}     # start with upper
    \p{Alphabetic}*          # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       \p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter}   ### next bit is lower, any zero or more, ending with upper
          |                  # or 
       \p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter}   ### next bit is upper, any zero or more, ending with lower
    )

    \p{Alphabetic}*          # anything that's left
    \b          # end at word 
/x

标有"###"的行有问题.

Have a problem with lines marked '###'.

此外,当假设数字大于和下划线等于小写字母时,如何修改正则表达式,因此W2X3是有效的CamelCase单词.

In addition, how to modify the regex when assuming than numbers and the underscore are equivalent to lowercase letters, so W2X3 is an valid CamelCase word.

已更新 :( ysth评论)

Updated: (ysth comment)

接下来的

  • any:表示大写或小写或数字或下划线"
  • any: mean "uppercase or lowercase or number or underscore"

正则表达式应与CamelWord,CaW相匹配

The regex should match CamelWord, CaW

  • 以大写字母开头
  • 可选任意
  • 小写字母或数字或下划线
  • 可选任意
  • 大写字母
  • 可选任意

请不要将其标记为重复项,因为并非如此. 原始问题(也有答案)只考虑了ascii.

Please, do not mark as duplicate, because it is not. The original question (and answers too) thought only ascii.

推荐答案

我真的无法告诉您您要做什么,但这应该更接近您最初的意图.不过,我仍然无法说出您的意思.

I really can’t tell what you’re trying to do, but this should be closer to what your original intent seems to have been. I still can’t tell what you mean to do with it, though.

m{
    \b
    \p{Upper}      #  start with uppercase code point (NOT LETTER)

    \w*            #  optional ident chars 

    # note that upper and lower are not related to letters
    (?:  \p{Lower} \w* \p{Upper}
      |  \p{Upper} \w* \p{Lower}
    )

    \w*

    \b
}x

请勿使用[a-z].实际上,不要使用\p{Lowercase_Letter}\p{Ll},因为它们与更理想,更正确的\p{Lowercase}\p{Lower}不同.

Never use [a-z]. And in fact, don’t use \p{Lowercase_Letter} or \p{Ll}, since those are not the same as the more desirable and more correct \p{Lowercase} and \p{Lower}.

请记住,\w实际上只是

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]

这篇关于Utf8在Perl中为CamelCase(WikiWord)正确的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆