如何避免 NLTK 的句子分词器在缩写上分裂? [英] How to avoid NLTK's sentence tokenizer splitting on abbreviations?
问题描述
我目前正在使用 NLTK 进行语言处理,但我遇到了句子标记化的问题.
I'm currently using NLTK for language processing, but I have encountered a problem of sentence tokenizing.
问题来了:假设我有一句话:图 2 显示了美国地图."当我使用 punkt tokenizer 时,我的代码如下所示:
Here's the problem: Assume I have a sentence: "Fig. 2 shows a U.S.A. map." When I use punkt tokenizer, my code looks like this:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['U.S.A', 'fig']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('Fig. 2 shows a U.S.A. map.')
它返回这个:
['Fig. 2 shows a U.S.A.', 'map.']
分词器无法检测缩写U.S.A.",但它对fig"有效.现在,当我使用 NLTK 提供的默认分词器时:
The tokenizer can't detect the abbreviation "U.S.A.", but it worked on "fig". Now when I use the default tokenizer NLTK provides:
import nltk
nltk.tokenize.sent_tokenize('Fig. 2 shows a U.S.A. map.')
这次我得到:
['Fig.', '2 shows a U.S.A. map.']
它识别更常见的美国"但看不到无花果"!
It recognizes the more common "U.S.A." but fails to see "fig"!
如何结合这两种方法?我想使用默认缩写选项以及添加我自己的缩写.
How can I combine these two methods? I want to use default abbreviation choices as well as adding my own abbreviations.
推荐答案
我认为缩写列表中的小写用于美国 对你来说很合适试试这个,
I think lower case for u.s.a in abbreviations list will work fine for you Try this,
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['u.s.a', 'fig']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('Fig. 2 shows a U.S.A. map.')
它返回给我:
['Fig. 2 shows a U.S.A. map.']
这篇关于如何避免 NLTK 的句子分词器在缩写上分裂?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!