在文本中查找美国街道地址(最好使用 Python 正则表达式) [英] FInd a US street address in text (preferably using Python regex)

查看:94
本文介绍了在文本中查找美国街道地址(最好使用 Python 正则表达式)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

免责声明:我非常仔细地阅读了这个帖子:街道地址搜索字符串 - Python 或 Ruby以及许多其他资源.

到目前为止,没有什么对我有用.

这里的更多细节是我正在寻找的是:

规则很宽松,我绝对不是要求一个涵盖所有情况的完美代码;只是一些简单的基本地址,假设地址应采用以下格式:

<块引用>

a) 街道号码(1...N 位);

b) 街道名称:一个或多个单词大写;

b-2)(可选)最好能以缩写为前缀.S."、N."、E."、W."

c)(可选)单元/公寓/等可以是任意(包括空)数量的任意字符

d) 街道类型":("st.", "ave.", "way") 之一;

e) 城市名称:1 个或多个大写单词;

f)(可选)状态缩写(2 个字母)

g)(可选)zip 是任意 5 位数字.

以上都不需要是有效的东西(例如现有的城市或邮编).

到目前为止我正在尝试这样的表达:

<块引用><块引用><块引用>

pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)

<预><代码>>>>pat.search("123 East Virginia Avenue, unit 123, San Ramondo, CA, 94444")

不工作,对我来说很难理解为什么.具体来说:我如何在我的模式中将一组任何单词与应该遵循的特定单词中的一个分开,例如 state abbrev.还是街道类型(st., ave.)"?

无论如何:这是我希望得到的一个例子:给定的def ex_addr(text):# 是否有魔法# 返回第一个地址(所有地址?)或 None 如果没有找到

for t in ['会议将于 11 月至 18 日在 22 West Westin st., South Carolina, 12345 举行','会议将于 11 月 -18 日在南卡罗来纳州西威斯汀街 22 号,邮编 12345 举行','你好,\n 明天见个面怎么样.@10am-sh in Chadds @ 123 S. Vancouver ave.在渥太华?\n谢谢!!!','你好,\n 明天见个面怎么样.@10am-sh in Chadds @ 123 S. Vancouver Avenue in Ottawa?\n谢谢!!!','这是 1999 年在蒙特利尔写的',位于加州库比蒂诺 420 滑稽巷的酷咖啡馆太酷了",我们在 12321 Mammoth Lane, Lexington MA 77777 参加派对;来喝杯啤酒!"] 打印 ex_addr(t)

我想得到:

<块引用>

'22 West Westin st., South Carolina, 12345''22 West Westin street, SC, 12345''123 S.温哥华大街.在渥太华''渥太华南温哥华大道 123 号'无 # 表示这是 1999 年在蒙特利尔写的",加州库比蒂诺 420 滑稽巷","12321 Mammoth Lane, Lexington MA 77777"

你能帮忙吗?

解决方案

\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?

在这个正则表达式中,你有一个太多的空格(在 ( \w+){1,5} 之前,它已经以 1 开头).删除它,它与您的示例相匹配.

我认为您不能假设单元 123"或类似单元会在那里,或者可能有多个单元(例如建筑 A,apt 3").请注意,在您的初始正则表达式中, . 可能与 , 匹配,这可能会导致很长(和不需要的)匹配.您可能应该接受几个这样的组,但有数量限制(例如,将 , (.*) 替换为 (, [^,]{1,20}){0,5}.

无论如何,您可能永远不会得到 100% 准确的东西,它会接受人们可能扔给他们的任何变化.做大量的测试!祝你好运.

Disclaimer: I read very carefully this thread: Street Address search in a string - Python or Ruby and many other resources.

Nothing works for me so far.

In some more details here is what I am looking for is:

The rules are relaxed and I definitely am not asking for a perfect code that covers all cases; just a few simple basic ones with assumptions that the address should be in the format:

a) Street number (1...N digits);

b) Street name : one or more words capitalized;

b-2) (optional) would be best if it could be prefixed with abbrev. "S.", "N.", "E.", "W."

c) (optional) unit/apartment/etc can be any (incl. empty) number of arbitrary characters

d) Street "type": one of ("st.", "ave.", "way");

e) City name : 1 or more Capitalized words;

f) (optional) state abbreviation (2 letters)

g) (optional) zip which is any 5 digits.

None of the above needs to be a valid thing (e.g. an existing city or zip).

I am trying expressions like these so far:

pat = re.compile(r'\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?', re.IGNORECASE)

>>> pat.search("123 East Virginia avenue, unit 123, San Ramondo, CA, 94444")

Don't work, and for me it's not easy to understand why. Specifically: how do I separate in my pattern a group of any words from one of specific words that should follow, like state abbrev. or street "type ("st., ave.)?

Anyhow: here is an example of what I am hoping to get: Given def ex_addr(text): # does the re magic # returns 1st address (all addresses?) or None if nothing found

for t in [
'The meeting will be held at 22 West Westin st., South Carolina, 12345 on Nov.-18',
'The meeting will be held at 22 West Westin street, SC, 12345 on Nov.-18',

'Hi there,\n How about meeting tomorr. @10am-sh in Chadds @ 123 S. Vancouver ave. in Ottawa? \nThanks!!!',
'Hi there,\n How about meeting tomorr. @10am-sh in Chadds @ 123 S. Vancouver avenue in Ottawa? \nThanks!!!',

'This was written in 1999 in Montreal',

"Cool cafe at 420 Funny Lane, Cupertino CA is way too cool",

"We're at a party at 12321 Mammoth Lane, Lexington MA 77777; Come have a beer!"
] print ex_addr(t)

I would like to get:

'22 West Westin st., South Carolina, 12345'
'22 West Westin street, SC, 12345'
'123 S. Vancouver ave. in Ottawa'
'123 S. Vancouver avenue in Ottawa'
None # for 'This was written in 1999 in Montreal',
"420 Funny Lane, Cupertino CA",
"12321 Mammoth Lane, Lexington MA 77777"

Could you please help?

解决方案

\d{1,4}( \w+){1,5}, (.*), ( \w+){1,5}, (AZ|CA|CO|NH), [0-9]{5}(-[0-9]{4})?

In this regex, you have one too many spaces (before ( \w+){1,5}, which already begins with one). Removing it, it matches your example.

I don't think you can assume that a "unit 123" or similar will be there, or there might be several ones (e.g. "building A, apt 3"). Note that in your initial regex, the . might match , which could lead to very long (and unwanted) matches. You should probably accept several such groups with a limitation on the number (e.g. replace , (.*) with something like (, [^,]{1,20}){0,5}.

In any case, you will probably never get something 100% accurate that will accept any variation people might throw at them. Do lots of tests! Good luck.

这篇关于在文本中查找美国街道地址(最好使用 Python 正则表达式)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆