如何在python中使用NLTK从原始文本中提取地址? [英] How can I extract address from raw text using NLTK in python?

查看:207
本文介绍了如何在python中使用NLTK从原始文本中提取地址?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有此文字

''``Sam D. Richards先生住在这里,新西22街44号 纽约,纽约12345 .您现在可以联系他吗?如果您需要任何帮助,请致电 我叫12345678'''

'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678'''

.如何使用NLTK从上述文本中提取地址部分?我尝试了Stanford NER Tagger,它只给我New York作为位置.该如何解决?

. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

推荐答案

绝对是正则表达式:)

类似

import re

txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)

# address = ['44 West 22nd Street, New York, NY 12345']

说明:

[0-9]{1,3}:1到3位数字,地址号码

[0-9]{1,3}: 1 to 3 digits, the address number

(space):数字和街道名称之间的空格

(space): a space between the number and the street name

.+:街道名称,出现任意次数的任意字符

.+: street name, any character for any number of occurrences

,:城市前的逗号和空格

,: a comma and a space before the city

.+:城市,出现任意次数的任意字符

.+: city, any character for any number of occurrences

,:状态前的逗号和空格

,: a comma and a space before the state

[A-Z]{2}:从A到Z恰好是2个大写字符

[A-Z]{2}: exactly 2 uppercase chars from A to Z

[0-9]{5}:5位数字

re.findall(expr, string)将返回一个数组,其中包含所有找到的匹配项.

re.findall(expr, string) will return an array with all the occurrences found.

这篇关于如何在python中使用NLTK从原始文本中提取地址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆