使用 nltk 改进人名的提取 [英] Improving the extraction of human names with nltk

查看：30 发布时间：2022/1/2 17:22:54 python nlp nltk

本文介绍了使用 nltk 改进人名的提取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从文本中提取人名.

I am trying to extract human names from text.

有人有推荐的方法吗?

这是我尝试过的(代码如下):我正在使用 nltk 来查找所有标记为一个人的东西，然后生成一个包含该人所有 NNP 部分的列表.我正在跳过只有一个 NNP 的人，以避免抓住一个孤独的姓氏.

This is what I tried (code is below): I am using nltk to find everything marked as a person and then generating a list of all the NNP parts of that person. I am skipping persons where there is only one NNP which avoids grabbing a lone surname.

我得到了不错的结果，但想知道是否有更好的方法来解决这个问题.

I am getting decent results but was wondering if there are better ways to go about solving this problem.

代码:

import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
print "LAST, FIRST"
for name in names: 
    last_first = HumanName(name).last + ', ' + HumanName(name).first
        print last_first

输出:

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

除了维珍银河，这都是有效的输出.当然，要知道维珍银河在本文的上下文中不是人名是困难的(也许是不可能的)部分.

Apart from Virgin Galactic, this is all valid output. Of course, knowing that Virgin Galactic isn't a human name in the context of this article is the hard (maybe impossible) part.

推荐答案

必须同意让我的代码更好"的建议不适合这个站点，但我可以给你一些方法尝试挖掘.

Must agree with suggestion that "make my code better" isn't well suited for this site, but I can give you some way where you can try to dig in.

查看斯坦福命名实体识别器 (NER).它的绑定已包含在 NLTK v 2.0 中，但您必须下载一些核心文件.这是脚本，它可以为您完成所有这些.

Take a look at Stanford Named Entity Recognizer (NER). Its binding has been included into NLTK v 2.0, but you must download some core files. Here is script which can do all of that for you.

我写了这个脚本:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

并没有那么糟糕的输出:

and got not so bad output:

('弗朗索瓦', '人')('R.', '人')('Velde', '人')('理查德', '人')('布兰森'，'人')('处女', '人')('银河'，'人')('比特币'，'人')('比特币'，'人')('保罗'，'人')('克鲁格曼'，'人')('拉里'，'人')('夏天', '人')('比特币'，'人')('尼克'，'人')('可乐', '人')

('Francois', 'PERSON') ('R.', 'PERSON') ('Velde', 'PERSON') ('Richard', 'PERSON') ('Branson', 'PERSON') ('Virgin', 'PERSON') ('Galactic', 'PERSON') ('Bitcoin', 'PERSON') ('Bitcoin', 'PERSON') ('Paul', 'PERSON') ('Krugman', 'PERSON') ('Larry', 'PERSON') ('Summers', 'PERSON') ('Bitcoin', 'PERSON') ('Nick', 'PERSON') ('Colas', 'PERSON')

希望对您有所帮助.

这篇关于使用 nltk 改进人名的提取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 nltk 改进人名的提取 [英] Improving the extraction of human names with nltk

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 nltk 改进人名的提取 [英] Improving the extraction of human names with nltk

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭