将名字和姓氏标记为一个标记 [英] Tokenizing first and last name as one token
问题描述
是否可以对标记中的文本进行标记,以便将名字和姓氏组合在一个标记中?例如,如果我的文字是:
Is is possible to tokenize a text in tokens such that first and last name are combined in one token? For example if my text is:
text = "Barack Obama is the President"
那么:
text.split()
结果:
['Barack', 'Obama', 'is', 'the, 'President']
我如何识别名字和姓氏?所以我只得到 ['Barack Obama', 'is', 'the', 'President']
作为标记.
how can I recognize the first and last name? So I get only ['Barack Obama', 'is', 'the', 'President']
as tokens.
有没有办法在 Python 中实现它?
Is there a way to achieve it in Python?
推荐答案
您正在寻找的是命名实体识别系统.我建议您不要将此视为标记化的一部分.
What you are looking for is a named entity recognition system. I suggest you do not consider this as part of tokenization.
对于python,您可以使用https://pypi.python.org/pypi/ner/一个>
For python you can use https://pypi.python.org/pypi/ner/
来自网站的示例
>>>tagger.json_entities("爱丽丝去了自然历史博物馆.")
'{组织":[自然历史博物馆"],人物":[爱丽丝"]}'
>>> tagger.json_entities("Alice went to the Museum of Natural History.")
'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
这篇关于将名字和姓氏标记为一个标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!