使用 SpaCy 和 Python 创建基于规则的匹配以检测地址 [英] Creating Rule-based matching with SpaCy and Python for detecting addresses

查看:46
本文介绍了使用 SpaCy 和 Python 创建基于规则的匹配以检测地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几天前我开始学习 Python's SpaCy lib 或 NLP.我想创建基于规则的匹配来检测街道地址.这是街道名称的示例:

I have started learning Python's SpaCy lib or NLP a few days ago. I want to create Rule-based matching for detecting street addresses. This is the example of street names:

Esplanade 12
Fischerinsel 65
Esplanade 1
62 boulevard d'Alsace
80 avenue Ferdinand de Lesseps
73 avenue de Bouvines
41 Avenue des Pr'es
84 rue du Château
44 rue Sadi Carnot
Bernstrasse 324
Güntzelstrasse 6
80 Rue St Ferréol
75 rue des lieutemants Thomazo
87 cours Franklin Roosevelt
51 rue du Paillle en queue
16 Chemin Des Bateliers
65 rue Reine Elisabeth
91 rue Saint Germain
Grolmanstraße 41
Buelowstrasse 46
Waßmannsdorfer Chaussee 41
Sonnenallee 29
Gotthardstrasse 81
Augsburger Straße 65
Gotzkowskystrasse 41
Holstenwall 69
Leopoldstraße 40

因此,街道名称的构成如下:

So, street names are formed like this:

第一种:

<string (thats ending with 'strasse', 'gasse' or 'platz')> + <number>(letter can be attached to number, for examle 34a)

第二种:

<number> + <'rue', 'avenue', 'platz', 'boulevard'> + <multiple strings strings>

第三种:

<titled string> + <number>

但前两种类型是 90% 的情况.这是代码:

But first two types are 90% of cases. This is the code:

import spacy
from spacy.matcher import Matcher
from spacy import displacy

nlp = spacy.load("en_core_web_trf")
disable = ['ner']
pattern = ['<i do not know how to write contitions for this>']

matcher = Matcher(nlp.vocab)
matcher.add("STREET", [pattern])

text_testing1 = "I live in Güntzelstrasse 16 in Berlin"
text_testing2 = "Send that to 73 rue de Napoleon 56 in Paris"

doc = nlp(text)
result = matcher(doc)
print(result)

我不知道如何为这种识别编写模式,所以我需要帮助.短语中需要有数字,其中一个字符串必须是rue"、avenue"、platz"、boulevard"或以strasse"结尾.或煤气".

I do not know how to write pattern for this kind of recognition, so I need help with that. Phrase needs to have number in it, one of the strings must be 'rue', 'avenue', 'platz', 'boulevard' or it has to end with "strasse" or "gasse".

推荐答案

这是一个非常简单的例子,它只匹配诸如*strasse [number]"之类的内容:

Here's a very simple example that matches just things like "*strasse [number]":

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [
        {"TEXT": {"REGEX": ".*strasse$"}}, 
        {"IS_DIGIT": True}
        ]
matcher.add("ADDRESS", [pattern])

doc = nlp("I live in Güntzelstrasse 16 in Berlin")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

关键部分是模式.通过改变模式,你可以让它匹配更多的东西,例如,如果我们想匹配不仅以 strasse 结尾的东西,还要匹配以 platz 结尾的东西:

The key part is the pattern. By changing the pattern you can make it match more things, for example if we want to match things that end in not just strasse but also platz:

pattern = [
        {"TEXT": {"REGEX": ".*(strasse|platz)$"}}, 
        {"IS_DIGIT": True}
        ]

您还可以添加具有相同标签的多个图案以获得非常不同的结构,例如您的拿破仑街"示例.

You can also add multiple patterns with the same label to get very different structures, like for your "rue de Napoleon" example.

Matcher 有很多功能,我真的建议通读文档并尝试一次.

The Matcher has a lot of features, I really recommend reading through the docs and trying them all out once.

这篇关于使用 SpaCy 和 Python 创建基于规则的匹配以检测地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆