在 Python 中的标记化文本中查找多词项 [英] Find multi-word terms in a tokenized text in Python

查看：19 发布时间：2022/1/2 17:21:58 python nlp nltk

本文介绍了在 Python 中的标记化文本中查找多词项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个已标记的文本，或者一般来说，单词列表也可以.例如:

 >>>从 nltk.tokenize 导入 word_tokenize>>>s = '''好的松饼在纽约售价 3.88 美元.请给我买... 两个.

谢谢.'''>>>word_tokenize(s)['好', '松饼', '成本', '$', '3.88', 'in', 'New', 'York', '.',请"、购买"、我"、两个"、的"、他们"、."、谢谢"、."]

如果我有一个包含单字和多字键的 Python 字典，我如何有效且正确地检查它们在文本中的存在?理想的输出是 key:location_in_text 对，或者其他类似的东西.提前致谢！

附言解释正确" - 如果我的字典中有租约"，我不希望请标记.此外，需要识别复数.我想知道这是否可以在没有很多 if-else 子句的情况下优雅地解决.

解决方案

如果您已经有一个多词表达式地名词典列表，您可以使用 MWETokenizer，例如:

<预><代码>>>>从 nltk.tokenize 导入 MWETokenizer>>>从 nltk 导入 sent_tokenize, word_tokenize>>>s = '''好的松饼在纽约售价 3.88 美元.请给我买... ... 两个. 谢谢.'''>>>mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator='_')>>>[mwe.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s)][['好'，'松饼'，'成本'，'$'，'3.88'，'in'，'New_York'，'.']，['请'，'买'，'我'，'...', 'two', 'of', 'them', '.'], ['Thanks', '.']]

I have a text that I have tokenized, or in general a list of words is ok as well. For example:

   >>> from nltk.tokenize import word_tokenize
    >>> s = '''Good muffins cost $3.88
in New York.  Please buy me
    ... two of them.

Thanks.'''
    >>> word_tokenize(s)
        ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
        'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

If I have a Python dict that contains single word as well as multi-word keys, how can I efficiently and correctly check for their presence in the text? The ideal output would be key:location_in_text pairs, or something as convenient. Thanks in advance!

P.S. To explain "correctly" - If I have "lease" in my dict, I do not wish Please marked. Also, recognizing plurals is required. I am wondering if this can be elegantly solved without many if-else clauses.

解决方案

If you already have a list of Multi-Word Expressions gazetteers, you can use MWETokenizer, e.g.:

>>> from nltk.tokenize import MWETokenizer
>>> from nltk import sent_tokenize, word_tokenize

>>> s = '''Good muffins cost $3.88
in New York.  Please buy me
...     ... two of them.

Thanks.'''

>>> mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator='_')


>>> [mwe.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New_York', '.'], ['Please', 'buy', 'me', '...', 'two', 'of', 'them', '.'], ['Thanks', '.']]

这篇关于在 Python 中的标记化文本中查找多词项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 Python 中的标记化文本中查找多词项 [英] Find multi-word terms in a tokenized text in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在 Python 中的标记化文本中查找多词项 [英] Find multi-word terms in a tokenized text in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭