如何解析自由街/邮寄地址出来的文字,和成组件 [英] How to parse freeform street/postal address out of text, and into components

查看:206
本文介绍了如何解析自由街/邮寄地址出来的文字,和成组件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们做的主要是美国的商业和正试图通过所有的地址字段组合成一个单一的文本面积,提高用户体验。但也有一些问题:


  • 地址用户类型可能不正确或标准格式

  • 地址必须被分成几部分(街道,城市,国家等)来处理信用卡支付

  • 用户可以输入的不仅仅是他们的地址多(比如他们的姓名或公司与它)

  • 谷歌能做到这一点,但服务条款和查询的限制都望而却步,尤其是在紧张的预算

显然,这是一个共同的问题:

有没有办法将地址从它周围的文字隔离并掰成块?是否有一个常规的前pression解析地址吗?


解决方案

我看到了这个问题,很多时候我工作了一个地址验证的公司。我在这里张贴的答案,使其对程序员谁是同样的问题围绕搜索更方便。该公司是我在处理数十亿的地址,我们学到了很多在这个过程中。

首先,我们需要了解有关地址的一些东西。

地址是不是定期

这意味着常规的前pressions都出来了。我已经看到了这一切,在一个非常特殊的格式相匹配的地址是简单的正前pressions,这样:


  

/\\s+(\\d{2,5}\\s+)(?![a|p]m\\b)(([a-zA-Z|\\s+]{1,5}){1,2})?([\\s|\\,|.]+)?(([a-zA-Z|\\s+]{1,30}){1,4})(court|ct|street|st|drive|dr|lane|ln|road|rd|blvd)([\\s|\\,|.|\\;]+)?(([a-zA-Z|\\s+]{1,30}){1,2})([\\s|\\,|.]+)?\\b(AK|AL|AR|AZ|CA|CO|CT|DC|DE|FL|GA|GU|HI|IA|ID|IL|IN|KS|KY|LA|MA|MD|ME|MI|MN|MO|MS|MT|NC|ND|NE|NH|NJ|NM|NV|NY|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VA|VI|VT|WA|WI|WV|WY)([\\s|\\,|.]+)?(\\s+\\d{5})?([\\s|\\,|.]+)/i


...来凡900+行类文件生成飞一个超大质量普通的前pression匹配甚至更多。我不建议这(例如,这里的正则表达式以上的小提琴,这使得大量的错误的)。没有一个简单的妙方得到这个工作。在理论和的的理论,它不可能用常规的前pression匹配的地址。

USPS出版物28 记录地址的许多格式是可能的,他们所有的关键字和variatons 。最糟糕的是,地址往往是模糊的。一句话可以代表多件事情(ST可以是圣徒或街),而且是我pretty确保他们发明的话。 (谁知道Stravenue是一个街道后缀?)

您会需要一些code真正了解的地址,如果该code确实存在,这是一个商业秘密。但你很可能推出自己如果你真的成说。

地址进来意想不到的形状和大小

下面是一些人为的(但完整)地址:

  1)102主要街道
    北京市东城,状态2)400N 600E#2,521733)口服#104 60203

即使这些都可能是有效的:

  4)829 LKSDFJlkjsdflkjsdljf Bkpw 123455)205 1105 14 90210

很明显,这些不规范。标点符号和换行不能保证。这里是发生了什么:


  1. 1号完成,因为它包含街道地址,城市和国家。有了这些信息,有足够的识别地址,并且它可以被认为是交付(与一些标准化)。


  2. 2号是完整的,因为它也包含一个街道地址(二次/单元号)和5位邮政编码code,这足以确定的地址。


  3. 3号是一个完整的邮政信箱格式,因为它包含了一个ZIP code。


  4. 4号也是因为完成的的ZIP code是唯一,这意味着私营实体或公司购买该地址空间。独特的ZIP code是高容量或集中交付的空间。任何给ZIP code 12345进入通用电气在纽约Schenectady。这个例子不会特别到任何人,但美国邮政仍然能够提供它。


  5. 第5号也齐全,信不信由你。只有那些号码,详细地址时,可以对所有可能的地址的数据库分析发现。在缺少directionals灌装,二级标志,和ZIP + 4 code是微不足道的,当你看到每一个数字作为一个组件。这里是什么样子,完全展开和标准化的:



  

205ñ1105W¯¯公寓14


  
  

比佛利山庄CA 90210-5221


地址数据是不是你自己的

在该持牌供应商提供的官方地址数据大多数国家,地址数据本身属于理事机构。在美国,美国邮政拥有的地址。同样是如此加拿大邮政,英国皇家邮政和其他,虽然每个国家强制执行或定义所有权有点不同。认识到这一点很重要,因为它通常禁止反向工程地址数据库。你必须要小心,如何获取,存储和使用这些数据。

谷歌地图是一种常见的走向进行快速修复地址,但 TOS 是相当令人望而却步;例如,您不能使用他们的数据或API,而不显示谷歌地图,并只用于非商业目的(除非你支付),并且不能存储数据(除临时缓存)。说得通。谷歌的数据是一些在世界上最好的。然而,谷歌地图做的的验证地址。如果地址不存在,它仍然会显示你所在地址的将会的,如果它的没有的存在(尝试在自己的街道;使用您知道门牌号码不存在)。这是非常有用的时候,但是要注意这一点。

Nominatim的使用政策的类似限制,特别是对高容量和商业用途,并且该数据被大多是从自由绘制消息人士透露,所以它不是作为很好的维护(例如是开放的项目性质) - 但是,这仍可能满足您的需求。它是由一个伟大的社会支持。

美国邮政总局本身有一个API,但它的股价下跌了很多并带有没有保证,也不支持。这也可能是很难使用。有些人,没有任何问题应谨慎使用。但它很容易错过的USPS要求您使用他们的API只能用于确认地址通过他们出货。

人们期望的地址是硬

不幸的是,我们已经调节我们的社会期望要复杂的地址。有几十个在互联网上关于这个好UX的文章,但事实是,如果你有单独的字段的地址形式,这就是用户所期望的,即使它使得更难不适合边缘的情况下地址格式化表格期待,或者形式要求领域它不应该。或用户不知道往哪里放他们的地址的某一部分。

我可以去和有关结账的不良UX形成这些天来,而是我只想说,该地址合并成一个单一领域将是一个的欢迎的变化 - 人们会可以输入自己的地址,他们是如何认为合适的,而不是试图找出你的冗长的表格。但是,这种变化将是的意外的用户可能一开始觉得有点不和谐。只是要意识到这一点。

这痛苦的一部分可以通过将在全国领域出门前,地址之前得到缓解。当他们第一次填写的乡间田野,你知道如何让你的形式出现。也许你有一个很好的方法来处理单场美国地址,所以如果他们选择美国,你可以减少你的表单单场,否则显示的组件领域。只是事情要考虑!

现在我们知道为什么很难;你可以做什么呢?

美国邮政总局通过一个名为CASS™认证,为客户提供经过验证的地址工序牌照的厂商。这些厂商有机会获得USPS数据库,每月更新一次。他们的软件必须符合严格的标准进行认证,并且如上所述不经常需要同意这样的限制条款。

有很多CASS认证的企业,可以处理列表或有API的:梅丽莎数据,益百利QAS和SmartyStreets仅举几

(由于越来越高射炮广告我已经被截断我在这一点上的答案。这是给你找到一个适合您的解决方案。)

真相:真的,伙计们,我没有在任何这些公司的工作。这不是一个广告。

We do business largely in the United States and are trying to improve user experience by combining all the address fields into a single text area. But there are a few problems:

  • The address the user types may not be correct or in a standard format
  • The address must be separated into parts (street, city, state, etc.) to process credit card payments
  • Users may enter more than just their address (like their name or company with it)
  • Google can do this but the Terms of Service and query limits are prohibitive, especially on a tight budget

Apparently, this is a common question:

Is there a way to isolate an address from the text around it and break it into pieces? Is there a regular expression to parse addresses?

解决方案

I saw this question a lot when I worked for an address verification company. I'm posting the answer here to make it more accessible to programmers who are searching around with the same question. The company I was at processed billions of addresses, and we learned a lot in the process.

First, we need to understand a few things about addresses.

Addresses are not regular

This means that regular expressions are out. I've seen it all, from simple regular expressions that match addresses in a very specific format, to this:

/\s+(\d{2,5}\s+)(?![a|p]m\b)(([a-zA-Z|\s+]{1,5}){1,2})?([\s|\,|.]+)?(([a-zA-Z|\s+]{1,30}){1,4})(court|ct|street|st|drive|dr|lane|ln|road|rd|blvd)([\s|\,|.|\;]+)?(([a-zA-Z|\s+]{1,30}){1,2})([\s|\,|.]+)?\b(AK|AL|AR|AZ|CA|CO|CT|DC|DE|FL|GA|GU|HI|IA|ID|IL|IN|KS|KY|LA|MA|MD|ME|MI|MN|MO|MS|MT|NC|ND|NE|NH|NJ|NM|NV|NY|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VA|VI|VT|WA|WI|WV|WY)([\s|\,|.]+)?(\s+\d{5})?([\s|\,|.]+)/i

... to this where a 900+ line-class file generates a supermassive regular expression on the fly to match even more. I don't recommend these (for example, here's a fiddle of the above regex, that makes plenty of mistakes). There isn't an easy magic formula to get this to work. In theory and by theory, it's not possible to match addresses with a regular expression.

USPS Publication 28 documents the many formats of addresses that are possible, with all their keywords and variatons. Worst of all, addresses are often ambiguous. Words can mean more than one thing ("St" can be "Saint" or "Street") and there are words that I'm pretty sure they invented. (Who knew that "Stravenue" was a street suffix?)

You'd need some code that really understands addresses, and if that code does exist, it's a trade secret. But you could probably roll your own if you're really into that.

Addresses come in unexpected shapes and sizes

Here are some contrived (but complete) addresses:

1)  102 main street
    Anytown, state

2)  400n 600e #2, 52173

3)  p.o. #104 60203

Even these are possibly valid:

4)  829 LKSDFJlkjsdflkjsdljf Bkpw 12345

5)  205 1105 14 90210

Obviously, these are not standardized. Punctuation and line breaks not guaranteed. Here's what's going on:

  1. Number 1 is complete because it contains a street address and a city and state. With that information, there's enough identify the address, and it can be considered "deliverable" (with some standardization).

  2. Number 2 is complete because it also contains a street address (with secondary/unit number) and a 5-digit ZIP code, which is enough to identify an address.

  3. Number 3 is a complete post office box format, as it contains a ZIP code.

  4. Number 4 is also complete because the ZIP code is unique, meaning that a private entity or corporation has purchased that address space. A unique ZIP code is for high-volume or concentrated delivery spaces. Anything addressed to ZIP code 12345 goes to General Electric in Schenectady, NY. This example won't reach anyone in particular, but the USPS would still be able to deliver it.

  5. Number 5 is also complete, believe it or not. With just those numbers, the full address can be discovered when parsed against a database of all possible addresses. Filling in the missing directionals, secondary designator, and ZIP+4 code is trivial when you see each number as a component. Here's what it looks like, fully expanded and standardized:

205 N 1105 W Apt 14

Beverly Hills CA 90210-5221

Address data is not your own

In most countries that provide official address data to licensed vendors, the address data itself belongs to the governing agency. In the US, the USPS owns the addresses. The same is true for Canada Post, Royal Mail, and others, though each country enforces or defines ownership a little differently. Knowing this is important, since it usually forbids reverse-engineering the address database. You have to be careful how to acquire, store, and use the data.

Google Maps is a common go-to for quick address fixes, but the TOS is rather prohibitive; for example, you can't use their data or APIs without showing a Google Map, and for non-commerical purposes only (unless you pay), and you can't store the data (except for temporary caching). Makes sense. Google's data is some of the best in the world. However, Google Maps does not verify the address. If an address does not exist, it will still show you where the address would be if it did exist (try it on your own street; use a house number that you know doesn't exist). This is useful sometimes, but be aware of that.

Nominatim's usage policy is similarly limiting, especially for high volume and commercial use, and the data is mostly drawn from free sources, so it isn't as well maintained (such is the nature of open projects) -- however, this may still suit your needs. It is supported by a great community.

The USPS itself has an API, but it goes down a lot and comes with no guarantees nor support. It might also be hard to use. Some people use it sparingly with no problems. But it's easy to miss that the USPS requires that you use their API only for confirming addresses to ship through them.

People expect addresses to be hard

Unfortunately, we've conditioned our society to expect addresses to be complicated. There's dozens of good UX articles all over the Internet about this, but the fact is, if you have an address form with individual fields, that's what users expect, even though it makes it harder for edge-case addresses that don't fit the format the form is expecting, or maybe the form requires a field it shouldn't. Or users don't know where to put a certain part of their address.

I could go on and on about the bad UX of checkout forms these days, but instead I'll just say that combining the addresses into a single field will be a welcome change -- people will be able to type their address how they see fit, rather than trying to figure out your lengthy form. However, this change will be unexpected and users may find it a little jarring at first. Just be aware of that.

Part of this pain can be alleviated by putting the country field out front, before the address. When they fill out the country field first, you know how to make your form appear. Maybe you have a good way to deal with single-field US addresses, so if they select United States, you can reduce your form to a single field, otherwise show the component fields. Just things to think about!

Now we know why it's hard; what can you do about it?

The USPS licenses vendors through a process called CASS™ Certification to provide verified addresses to customers. These vendors have access to the USPS database, updated monthly. Their software must conform to rigorous standards to be certified, and they don't often require agreement to such limiting terms as discussed above.

There are many CASS-Certified companies that can process lists or have APIs: Melissa Data, Experian QAS, and SmartyStreets to name a few.

(Due to getting flak for "advertising" I've truncated my answer at this point. It's up to you to find a solution that works for you.)

The Truth: Really, folks, I don't work at any of these companies. It's not an advertisement.

这篇关于如何解析自由街/邮寄地址出来的文字,和成组件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆