在Python的标记化文字中寻找多字词 [英] Find multi-word terms in a tokenized text in Python

查看:66
本文介绍了在Python的标记化文字中寻找多字词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个已被标记的文本,或者一般来说,单词列表也是可以的.例如:

I have a text that I have tokenized, or in general a list of words is ok as well. For example:

   >>> from nltk.tokenize import word_tokenize
    >>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
    ... two of them.\n\nThanks.'''
    >>> word_tokenize(s)
        ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
        'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

如果我有一个包含单个单词和多个单词键的Python字典,那么如何有效和正确地检查它们在文本中的存在?理想的输出应该是key:location_in_text对,或者是一些方便的东西. 预先感谢!

If I have a Python dict that contains single word as well as multi-word keys, how can I efficiently and correctly check for their presence in the text? The ideal output would be key:location_in_text pairs, or something as convenient. Thanks in advance!

P.S.解释正确"-如果我的词典中有租赁",我不希望请打上标记.另外,需要识别复数.我想知道如果没有许多if-else子句,是否可以很好地解决这个问题.

P.S. To explain "correctly" - If I have "lease" in my dict, I do not wish Please marked. Also, recognizing plurals is required. I am wondering if this can be elegantly solved without many if-else clauses.

推荐答案

如果您已经有多词表达式"地名词典的列表,则可以使用MWETokenizer,例如:

If you already have a list of Multi-Word Expressions gazetteers, you can use MWETokenizer, e.g.:

>>> from nltk.tokenize import MWETokenizer
>>> from nltk import sent_tokenize, word_tokenize

>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
...     ... two of them.\n\nThanks.'''

>>> mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator='_')


>>> [mwe.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New_York', '.'], ['Please', 'buy', 'me', '...', 'two', 'of', 'them', '.'], ['Thanks', '.']]

这篇关于在Python的标记化文字中寻找多字词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆