使用 \d 扫描字符串中的 Unicode 数字 [英] Scanning for Unicode Numbers in a string with \d

查看:51
本文介绍了使用 \d 扫描字符串中的 Unicode 数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Oniguruma 文档\d 字符类型匹配:

According to the Oniguruma documentation, the \d character type matches:

十进制字符
Unicode:General_Category -- Decimal_Number

decimal digit char
Unicode: General_Category -- Decimal_Number

但是,在包含所有 Decimal_Number 字符的字符串中扫描 \d 会导致仅匹配拉丁文 0-9 数字:

However, scanning for \d in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched:

#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

我是否误读了文档?为什么 \d 不匹配其他 Unicode 数字,和/或有没有办法让它这样做?

Am I misreading the documentation? Why doesn't \d match other Unicode numerals, and/or is there a way to make it do so?

推荐答案

注:Brian Candler红宝石谈话:

  • \w 只匹配 ASCII 字母和数字,而 [[:alpha:]] 匹配全套 Unicode 字母.
  • \d 只匹配 ASCII 数字,而 [[:digit:]] 匹配全套 Unicode 数字.
  • \w only matches ASCII letters and digits, while [[:alpha:]] matches the full set of Unicode letters.
  • \d only matches ASCII digits, while [[:digit:]] matches the full set of Unicode numbers.

因此行为是一致的",我们对 Unicode 数字有一个简单的解决方法.阅读同一个Oniguruma doc\wa> 我们看到文字:

The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w in the same Oniguruma doc we see the text:

\w  word character  
    Not Unicode: alphanumeric, "_" and multibyte char.  
    Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)

鉴于 Ruby 的真实行为和上面的非 Unicode"文本,文档似乎描述了两种模式——Unicode 模式和非 Unicode 模式——而且 Ruby 正在非 Unicode 模式下运行.

In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.

这将解释为什么 \d 与完整的 Unicode 集不匹配:尽管 Oniguruma 文档未能准确描述在非 Unicode 模式下匹配的内容,但我们现在知道该行为记录为"Unicode"是不可预期的.

This would explain why \d does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.

p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]

作为练习,读者可以发现如何(如果有的话)在 Ruby 正则表达式中启用 Unicode 模式,作为 /u 标志(例如 /\w/u) 不这样做.(也许必须使用 Oniguruma 的特殊标志重新编译 Ruby.)

It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u flag (e.g. /\w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)

更新:看起来我链接到的 Oniguruma 文档对于 Ruby 1.9 来说并不准确.请参阅此票证讨论,包括以下帖子:

Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:

[Yui NARUSE] "RE.txt 是用于原始 Oniguruma,而不是用于 Ruby 1.9 的正则表达式.我们可能需要自己的文档."
[Matz] 我们的 Oniguruma 是分叉的.在 geocities.jp 中找到的原始 Oniguruma 没有改变."

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."

更好的参考:这里是关于 Ruby 1.9 正则表达式语法的官方文档:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc

Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc

这篇关于使用 \d 扫描字符串中的 Unicode 数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆