Unicode和:alpha: [英] Unicode and :alpha:

查看:99
本文介绍了Unicode和:alpha:的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这为什么是 false

iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/)
false

但这是 true ?:

iex(2)> String.match?("汉语漢語", ~r/[[:alpha:]]/)
true

有时 [:alpha:] 是unicode,有时不是?

Sometimes [:alpha:] is unicode and sometimes it's not?

我认为我的原始示例不够清楚。

I don't think my original example was clear enough.

为什么这个 false

iex(1)> String.match?("汉", ~r/^[[:alpha:]]+$/)
false

但这是 true ?:

iex(2)> String.match?("汉", ~r/[[:alpha:]]/)
true


推荐答案

以非Unicode模式将字符串传递给正则表达式时,会将其视为字节数组,而不是Unicode字符串。请参见 IO.puts byte_size(汉语汉语)(12,输入包含的所有字节: 230,177,137,232,175,173,230,188,162,232,170,158 )和 IO.puts String.length(汉语汉语)(4,Unicode字母)的区别。字符串中的某些字节不能与 [:alpha:] POSIX字符类匹配。因此,第一个表达式不起作用,而第二个表达式则起作用,因为它只需要一个字符即可返回有效的匹配项。

When you pass the string to the regex in a non-Unicode mode, it is treated as an array of bytes, not as a Unicode string. See IO.puts byte_size("汉语漢語") (12, all bytes that the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158) and IO.puts String.length("汉语漢語") (4, the Unicode "letters") difference. There are bytes in the string that cannot be matched with the [:alpha:] POSIX character class. Thus, the first expression does not work, while the second works as it only needs 1 character to return a valid match.

使用PCRE regex库正确匹配Unicode字符串(在Elixir中使用),则需要使用 / u 修饰符启用Unicode模式:

To properly match Unicode strings with PCRE regex library (that is used in Elixir), you need to enable the Unicode mode with /u modifier:

IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)

请参见 IDEONE演示(打印 true

请参见 Elixir正则表达式引用


unicode(u )-启用Unicode特定模式,例如 \p 并更改修饰符,例如 \w \W \s 和朋友也可以在unicode上进行匹配。它期望在匹配时给出有效的unicode字符串。

unicode (u) - enables unicode specific patterns like \p and changes modifiers like \w, \W, \s and friends to also match on unicode. It expects valid unicode strings to be given on match.

这篇关于Unicode和:alpha:的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆