使用 NLP 进行地址拆分 [英] Address Splitting with NLP

查看:70
本文介绍了使用 NLP 进行地址拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在开展一个项目,该项目应该识别地址的每个部分,例如来自str.Jack London 121, Corvallis, ARAD, ap.1603、973130输出应该是这样的:

I am working currently on a project that should identify each part of an address, for example from "str. Jack London 121, Corvallis, ARAD, ap. 1603, 973130 " the output should be like this:

street name: Jack London; 
no: 121; city: Corvallis; 
state: ARAD; 
apartment: 1603; 
zip code: 973130

问题在于并非所有输入数据的格式都相同,因此某些元素可能会丢失或顺序不同,但可以保证是地址.

The problem is that not all of the input data are in the same format so some of the elements may be missing or in different order, but it is guaranteed to be an address.

我在互联网上查看了一些资源,但其中很多仅适用于美国地址 - 就像 Google API Places,问题是我会将其用于其他国家/地区.

I checked some sources on the internet, but a lot of them are adapted for US addresses only - like Google API Places, the thing is that I will use this for another country.

Regex 不是一个选项,因为地址可能变化太多.

Regex is not an option since the address may variate too much.

我也考虑过 NLP 使用命名实体识别模型,但我不确定这是否可行.

I also thought about NLP to use Named Entity Recognition model but I'm not sure that will work.

你知道什么是一个好的开始方式吗?也许可以帮我一些提示?

Do you know what could a be a good way to start, and maybe help me with some tips?

推荐答案

Data 中有一个类似问题Science Stack Exchange 论坛只有一个建议使用 SpaCy 的答案.

There is a similar question in Data Science Stack Exchange forum with only one answer suggesting using SpaCy.

另一个关于使用斯坦福 NLP 检测地址的问题详细介绍了另一种检测地址及其组成部分的方法.

Another question on detecting addresses using Stanford NLP details another approach to detecting addresses and its constituents.

有一个 LexNLP 库,具有检测和拆分地址的功能这种方式(摘自 TowardsDatascience 文章在图书馆):

There is a LexNLP library that has a feature to detect and split addresses this way (snippet borrowed from TowardsDatascience article on the library):

from lexnlp.extract.en.addresses import address_features
for filename,text in d.items():
    print(list(lexnlp.extract.en.addresses.address_features.get_word_features(text)))

还有一个比较新的(2018)和研究"的代码 DeepParse(和 文档) 用于深度学习地址解析,随附 IEEE 文章(付费专区)或 norelrefern="语义学者.

There is also a relatively new (2018) and "researchy" code DeepParse (and documentation) for deep learning address parsing accompanying an IEEE article (paywall) or Semantic Scholar.

对于训练,您将需要使用一些大型地址语料库或使用生成的假地址,例如Faker 库.

For the training you will need to use some large corpora of addresses or fake addresses generated using, e.g. Faker library.

这篇关于使用 NLP 进行地址拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆