通用地址解析器自由的文本 [英] General Address Parser for Freeform Text

查看:225
本文介绍了通用地址解析器自由的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个显示地图数据(认为谷歌地图,但有更多的互动和定制层为我们的客户)。

We have a program that displays map data (think Google Maps, but with much more interactivity and custom layers for our clients).

我们可以导航通过一系列的组合框,$一堆数据(的对$ P灌注某些领域,即:国家:加拿大,省域填写选择安大略省,县/区列表被填满。在选择一个县/区和城市充满的,等...)。

We allow navigation via a set of combo boxes that prefill certain fields with a bunch of data (ie: Country: Canada, the Province field is filled in. Select Ontario, and a list of Counties/Regions is filled in. Select a county/region, and a city is filled in, etc...).

虽然这保证了精确的地址,它为用户带来的痛苦,如果他们不知道一个街​​道地址,一个城市的位置(即,哪个县/区基奇纳吗?)。

While this guarantees accurate addresses, it's a pain for the users if they don't know where a street address or a city are located (ie, which county/region is kitchener in?).

所以,我们正在努力做一个地址解析器有一个不规则形状的文本字段。

So we are looking at trying to do an address parser with a freeform text field.

用户可以进入这样的事情(类似于谷歌地图,Bing地图等): 22主街,就可以与

The user could enter something like this (similar to Google Maps, Bing Maps, etc...): 22 Main St, Kitchener, On

和我们可以划分成部分,并做数据查询,并得到他们正在寻找的点(或提出其他建议)。

And we could compartmentalize it into sections and do lookups on the data and get to the point they are looking for (or suggest alternatives).

这里的问题是,我们如何正确划分信息?我们如何分手的章节,并找到可能的匹配?我猜,我们也不能保证用户将在一种格式,我们总是期望(显然)输入数据。的后续行动,这将是如何present的数据,如果我们没有找到完全匹配的(或找到多个准确匹配......两个城市在不同县同一条街的名字,例如)。

The problem with this is that how do we properly compartmentalize information? How do we break up the sections and find possible matches? I'm guessing we wouldn't be guaranteed that the user would enter data in a format we always expected (obviously). A follow up to this would be how to present the data if we don't find an exact match (or find multiple exact matches... two cities with the same street name in different counties, for example).

我们已经在映射数据可用大量的数据(MapInfo的标签格式居多)。因此,我们可以做的街道名称,城市,国家等进行快速扫描,但我不知道去了解处理这个问题的最好办法。当然,使用谷歌地图将是很好,布埃我们的大多数客户都是在封闭的网络之外的地方访问通常不会允许的,大多数都不愿意依靠谷歌地图(因为它,因为他们需要不包含尽可能多的信息,如自定义地图图层)。他们可以,很明显,去谷歌,并得到适当的位置,然后移动到我们的软件,但是这会耗费时间和过程的速度可以说是相当的重要。

We have a ton of data available in the mapping data (mapinfo tab format mostly). So we can do quick scans of street names, cities, states, etc. But I'm not sure about the best way to go about approaching this problem. Sure, using Google Maps would be nice, bue most of our clients are in closed in networks where outside access is not usually allowed and most aren't willing to rely on google maps (since it doesn't contain as much information as they need, such as custom map layers). They could, obviously, go to google and get the proper location then move to our software, but this would time consuming and speed of the process can be quite important.

推荐答案

这实质上是一类命名实体解决的问题。 净入学率在维基百科

This is essentially a class of the Named Entity Resolution problem. NER on Wikipedia

接近最好的方法是使用一种语言转换器,以确定不同的结构来解析地址 - 一个方法类似于使用常规EX pressions一个有限状态机

The best way to approach this is to parse the address using a language transducer to identify various constructs - an approach is similar to using regular expressions with a finite state machine.

我已经受够了被称为 GATE 了Java NLP和机器学习框架了巨大的成功,他们的换能器的lib被称为戏言。看看他们的图形用户界面,并用它来写一些Java code吧!

I've had great success with the Java NLP and Machine learning framework called GATE, and their transducer lib is called Jape. Check out their GUI, and use that to write some Java code for it!

其内置的例子应该让你开始的基本知识,然后你可以根据需要进行扩展。本质上,它的文字间隔化到使用规则和规则引擎组件,所以像,

Their built in examples should get you started with the basics, and you can then extend it as needed. Essentially, it compartmentalizes text into components using the rules and the rule engine, so something like,

Xyz, Blah St,
Foo City, 11110, CA

将被转换为,

would be translated to,

Place: Xyz
Street: Blah St
City: Foo
...

然后你可以使用的地方你的数据库做匹配。

And then you can use your database of locations to do matches.

戏言还支持词典查询,除了规则 - 所以,如果你已经有嗒嗒圣在您的数据库,它有2个父母 - 城市Foo和Bar - 你只是通过解析下一行歧义

Jape also supports dictionary lookups, apart from rules - so if you already have "Blah St" in your database, and it has 2 parents - city Foo and Bar - you just disambiguate by parsing the next line.

编辑:GATE包括一个名为安妮的工具 - 一个信息提取系统,可以与周围的播放,以确定地址。这将使用一些内置的戏言规则,你可以在构建。

GATE includes a tool called ANNIE - an information extraction system, that can be played around with to identify addresses. This uses some built in Jape rules that you can build upon.

这篇关于通用地址解析器自由的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆