Ruby 超级不敏感的正则表达式将学校名称与重音和其他变音符号匹配 [英] Ruby super-insensitive Regex to match school names with accents and other diacritics

查看:69
本文介绍了Ruby 超级不敏感的正则表达式将学校名称与重音和其他变音符号匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题已经在其他编程语言中提出过,但是你将如何在 Ruby 上执行一个不区分重音的正则表达式?

我当前的代码类似于

scope :by_registered_name, ->(regex){where(:name =>/#{Regexp.escape(regex)}​​/i)}

我想也许我可以用点替换非字母数字+空白字符,并删除 escape,但是没有更好的方法吗?如果我这样做,恐怕我会捕捉到奇怪的东西...

我现在的目标是法语,但如果我也可以针对其他语言修复它会很酷.

如果有帮助,我正在使用 Ruby 2.3.

<小时>

我意识到我的要求实际上有点强,我还需要捕捉破折号之类的东西.我基本上是导入学校数据库(URL 在这里,标签是),我想要人能够通过键入其名称找到他们的学校.搜索查询和搜索请求都可能包含重音符号,我认为最简单的方法是让两者"都不敏感.

  • Télécom"应与Telecom"匹配
  • "établissement" 应与 "etablissement" 匹配
  • Institut supérieur national de l'artisanat - Chambre de métiers et de l'Artisanat en Moselle"应与artisanat chambre de métiers"相匹配
  • Ecole hôtelière d'Avignon (CCI du Vaucluse)" 应该与 Ecole hoteliere d'avignon 匹配(括号内可以跳过)
  • Ecole française d'hôtesses"应与ecole francaise d'hot"相匹配

我在那个数据库中也发现了一些疯狂的东西,我认为我会考虑清理这个输入

  • Académie internationale de management - Hotel & Tourism Management Academy"应与Hotel Tourism"匹配(注意&实际上是写在XML中的&amp;)

解决方案

看起来 MongoDB 的解决方案是使用 text 索引,即变音符号不敏感.支持.

自从我上次使用 MongoDB 已经有很长时间了,但是如果您使用的是 Mongoid,我认为您会在模型中创建一个 text 索引,如下所示:

index(name: "text")

...然后像这样搜索:

scope :by_registered_name, ->(str) {where(:$text => { :$search => str })}

查阅$text 查询的文档操作员了解更多信息.

原始(错误)答案

<块引用>

事实证明,我是在向后思考这个问题,最初写了这个答案.我保留它,因为它可能仍然派上用场.如果您使用的数据库不提供此类功能(就像 MongoDB 提供的功能),可能的解决方法是使用以下技术在数据库中存储经过清理的名称和原始名称,并且然后同样清理查询.

由于您使用的是 Rails,因此您可以使用方便的 <代码>ActiveSupport::Inflector.transliterate:

regex =/aäoöuü/transliterated = ActiveSupport::Inflector.transliterate(regex.source, '\?')# =>呜呜呜"new_regex = Regexp.new(音译)# =>/aaoouu/

或者干脆:

Regexp.new(ActiveSupport::Inflector.transliterate(regex.source, '\?'))

您会注意到我提供了 '\?' 作为第二个参数,它是将替换任何无效 UTF-8 字符的替换字符串.这是因为默认替换字符串是 "?",正如您所知,它在正则表达式中具有特殊含义.

另请注意,ActiveSupport::Inflector.transliterate 比类似的 I18n.transliterate 做得更多.这是它的来源:

def transliterate(string, replacement = "?")I18n.transliterate(ActiveSupport::Multibyte::Unicode.normalize(ActiveSupport::Multibyte::Unicode.tidy_bytes(string), :c),:替换 =>替代品)结尾

最里面的方法调用,ActiveSupport::Multibyte::Unicode.tidy_bytes,清除任何无效的 UTF-8 字符.

更重要的是,ActiveSupport::Multibyte::Unicode.normalize 规范化"字符.例如,ê 看起来像一个字符,但实际上是两个字符:拉丁小写字母 E 和组合圆形重音.调用 I18n.transliterate("ê") 会产生 e?,这可能不是你想要的,所以调用 normalizeê 变成 ê,它只是一个字符:带圆环的拉丁文小写字母 E.在 ê(前者)上调用 I18n.transliterate 会产生 e?,这可能不是你想要的,所以 transliterate 之前的 normalize 步骤很重要.(如果您对其工作原理感兴趣,请阅读Unicode 等效和规范化.)>

The question has been asked in other programming languages, but how would you perform an accent insensitive regex on Ruby ?

My current code is something like

scope :by_registered_name, ->(regex){
  where(:name => /#{Regexp.escape(regex)}/i)
}

I thought maybe I could replace non-alphanumeric+whitespace characters by dots, and remove the escape, but is there not a better way ? I'm afraid I could catch weird things if I do that...

I am targeting French right now, but if I could also fix it for other languages that would be cool.

I am using Ruby 2.3 if that can help.


I realize my requirements are actually a bit stronger, I also need to catch things like dashes, etc. I am basically importing a school database (URL here, the tag is <nom>), and I want people to be able to find their schools by typing its name. Both the search query and search request may contain accents, I believe the easiest way would be to make "both" insensitive.

  • "Télécom" should be matched by "Telecom"
  • "établissement" should be matched by "etablissement"
  • "Institut supérieur national de l'artisanat - Chambre de métiers et de l'Artisanat en Moselle" should be matched by "artisanat chambre de métiers
  • "Ecole hôtelière d'Avignon (CCI du Vaucluse)" Should be matched by Ecole hoteliere d'avignon" (for the parenthesis it's okay to skip it)
  • "Ecole française d'hôtesses" should be matched by "ecole francaise d'hot"

Also crazy stuff I found in that DB, I will consider sanitizing this input I think

  • "Académie internationale de management - Hotel & Tourism Management Academy" Should be matched by "Hotel Tourism" (note the & is actually written &amp; in the XML)

解决方案

It looks like the solution for MongoDB is to use a text index, which is diacritic insensitive. French is supported.

It's been a long time since I last used MongoDB, but if you're using Mongoid I think you would create a text index in your model like this:

index(name: "text")

...and then search like this:

scope :by_registered_name, ->(str) {
  where(:$text => { :$search => str })
}

Consult the documentation for the $text query operator for more information.

Original (wrong) answer

As it turns out I was thinking about the question backwards, and wrote this answer initially. I'm preserving it since it might still come in handy. If you were using a database that didn't offer this kind of functionality (like, it seems, MongoDB does), a possible workaround would be to use the following technique to store a sanitized name along with the original name in the database, and then likewise sanitize queries.

Since you're using Rails you can use the handy ActiveSupport::Inflector.transliterate:

regex = /aäoöuü/
transliterated = ActiveSupport::Inflector.transliterate(regex.source, '\?')
# => "aaoouu"
new_regex = Regexp.new(transliterated)
# => /aaoouu/

Or simply:

Regexp.new(ActiveSupport::Inflector.transliterate(regex.source, '\?'))

You'll note that I supplied '\?' as the second argument, which is the replacement string that will replace any invalid UTF-8 characters. This is because the default replacement string is "?", which as you know has special meaning in a regular expression.

Also note that ActiveSupport::Inflector.transliterate does a little bit more than the similar I18n.transliterate. Here's its source:

def transliterate(string, replacement = "?")
  I18n.transliterate(ActiveSupport::Multibyte::Unicode.normalize(
    ActiveSupport::Multibyte::Unicode.tidy_bytes(string), :c),
      :replacement => replacement)
end

The innermost method call, ActiveSupport::Multibyte::Unicode.tidy_bytes, cleans up any invalid UTF-8 characters.

More importantly, ActiveSupport::Multibyte::Unicode.normalize "normalizes" the characters. For example, looks like one character but it's actually two: LATIN SMALL LETTER E and COMBINING CIRCUMFLEX ACCENT. Calling I18n.transliterate("ê") would yield e?, which probably isn't what you want, so normalize is called to turn into ê, which is just one character: LATIN SMALL LETTER E WITH CIRCUMFLEX. Calling I18n.transliterate on (the former) would yield e?, which probably isn't what you want, so that normalize step before transliterate is important. (If you're interested in how that works, read about Unicode equivalence and normalization.)

这篇关于Ruby 超级不敏感的正则表达式将学校名称与重音和其他变音符号匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆