在Ruby中解析街道地址 [英] Parsing street addresses in Ruby

查看:104
本文介绍了在Ruby中解析街道地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将地址处理为数据库的相应字段格式.我可以获取门牌号和街道类型,但尝试确定获取没有编号和姓氏的街道的最佳方法.收到的标准街道地址为:

I am processing addresses into their respective field format for the database. I can get the house number out and the street type but trying to determine best method to get the street without number and last word. A standard street address received would be:

    res[:address] = '7707 Foo Bar Blvd'

到目前为止,我可以解析以下内容:

As of now I can parse the following:

    house = res[:address].gsub(/\D/, '')
    street_type = res[:address].split(/\s+/).last

我的第一个挑战是如何获得"Foo Bar".请注意,街道名称可以是一个,两个或三个词.我正在努力在Ruby中找到单行表达式解决方案.

My first challenge is how to get 'Foo Bar'. Note the street name could be one, two or three words. I am struggling to find a one line expression solution for this in Ruby.

我的第二个问题是如何改善房屋"代码,以处理末尾带有字母的房屋号码.例如"7707B".

My second question is how to perhaps improve on the 'house' code to deal with house numbers that have an alpha at the end. For example, "7707B".

最后,如果您可以参考一个很好的备忘单,并提供有关这些表达方式的示例,将会很有帮助.

Lastly if you can reference a good cheat sheet with examples for these expression that would be helpful.

推荐答案

如果可能,我建议为此使用一个库,因为地址解析可能很困难.查看 Indirizzo Ruby gem,这很容易:

I'd recommend using a library for this if possible, since address parsing can be difficult. Check out the Indirizzo Ruby gem, which makes this easy:

require 'Indirizzo'
address = Indirizzo::Address.new("7707 Foo Bar Blvd")
address.number
 => "7707"
address.street
 => ["foo bar blvd", "foo bar boulevard"] 

即使您使用Indirizzo库本身,通读其源代码对于查看它们如何解决问题也可能非常有用.例如,它具有经过微调的正则表达式以匹配地址的不同部分:

Even if you don't use the Indirizzo library itself, reading through its source code is probably very useful to see how they solved the problem. For instance, it has finely-tuned regular expressions to match different parts of an address:

Match = {
  # FIXME: shouldn't have to anchor :number and :zip at start/end
  :number   => /^(\d+\W|[a-z]+)?(\d+)([a-z]?)\b/io,
  :street   => /(?:\b(?:\d+\w*|[a-z'-]+)\s*)+/io,
  :city     => /(?:\b[a-z][a-z'-]+\s*)+/io,
  :state    => State.regexp,
  :zip      => /\b(\d{5})(?:-(\d{4}))?\b/o,
  :at       => /\s(at|@|and|&)\s/io,
  :po_box => /\b[P|p]*(OST|ost)*\.*\s*[O|o|0]*(ffice|FFICE)*\.*\s*[B|b][O|o|0][X|x]\b/
}

这些来自其源代码的文件可以提供更多细节:

These files from its source code can give more specifics:

  • https://github.com/daveworth/Indirizzo/blob/master/lib/indirizzo/address.rb
  • https://github.com/daveworth/Indirizzo/blob/master/lib/indirizzo/constants.rb
  • https://github.com/daveworth/Indirizzo/blob/master/lib/indirizzo/numbers.rb

(但是我通常也会同意@drhenner的评论,即为了使自己更轻松,您可以只在单独的字段中接受这些数据输入.)

(But I would also generally agree with @drhenner's comment that, in order to make this easier on yourself, you could probably just accept these data inputs in separate fields.)

编辑:要给出有关如何删除街道后缀(例如"Blvd")的更具体答案,可以使用Indirizzo的正则表达式常量(例如constants.rb中的Suffix_Type ),就像这样:

To give a more specific answer about how to remove the street suffix (e.g., "Blvd"), you could use Indirizzo's regular expression constants (such as Suffix_Type from constants.rb) like so:

address = Indirizzo::Address.new("7707 Foo Bar Blvd", :expand_streets => false)
address.street.map {|street| street.gsub(Indirizzo::Suffix_Type.regexp, '').strip }
 => ["foo bar"]

(注意,我也将:expand_streets => false传递给了初始化程序,以避免同时扩展"Blvd"和"Boulevard"替代项,因为无论如何我们都放弃了后缀.)

(Notice I also passed :expand_streets => false to the initializer, to avoid having both "Blvd" and "Boulevard" alternatives expanded, since we're discarding the suffix anyway.)

这篇关于在Ruby中解析街道地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆