Unicode和:alpha: [英] Unicode and :alpha:
问题描述
这为什么是 false
:
iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/)
false
但这是 true
?:
iex(2)> String.match?("汉语漢語", ~r/[[:alpha:]]/)
true
有时 [:alpha:]
是unicode,有时不是?
Sometimes [:alpha:]
is unicode and sometimes it's not?
我认为我的原始示例不够清楚。
I don't think my original example was clear enough.
为什么这个 false
:
iex(1)> String.match?("汉", ~r/^[[:alpha:]]+$/)
false
但这是 true
?:
iex(2)> String.match?("汉", ~r/[[:alpha:]]/)
true
推荐答案
以非Unicode模式将字符串传递给正则表达式时,会将其视为字节数组,而不是Unicode字符串。请参见 IO.puts byte_size(汉语汉语)
(12,输入包含的所有字节: 230,177,137,232,175,173,230,188,162,232,170,158
)和 IO.puts String.length(汉语汉语)
(4,Unicode字母)的区别。字符串中的某些字节不能与 [:alpha:]
POSIX字符类匹配。因此,第一个表达式不起作用,而第二个表达式则起作用,因为它只需要一个字符即可返回有效的匹配项。
When you pass the string to the regex in a non-Unicode mode, it is treated as an array of bytes, not as a Unicode string. See IO.puts byte_size("汉语漢語")
(12, all bytes that the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158
) and IO.puts String.length("汉语漢語")
(4, the Unicode "letters") difference. There are bytes in the string that cannot be matched with the [:alpha:]
POSIX character class. Thus, the first expression does not work, while the second works as it only needs 1 character to return a valid match.
使用PCRE regex库正确匹配Unicode字符串(在Elixir中使用),则需要使用 / u
修饰符启用Unicode模式:
To properly match Unicode strings with PCRE regex library (that is used in Elixir), you need to enable the Unicode mode with /u
modifier:
IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)
请参见 IDEONE演示(打印 true
)
请参见 Elixir正则表达式引用:
unicode(u )
-启用Unicode特定模式,例如\p
并更改修饰符,例如\w
,\W
,\s
和朋友也可以在unicode上进行匹配。它期望在匹配时给出有效的unicode字符串。
unicode (u)
- enables unicode specific patterns like\p
and changes modifiers like\w
,\W
,\s
and friends to also match on unicode. It expects valid unicode strings to be given on match.
这篇关于Unicode和:alpha:的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!