使用python和nltk从文本文件中提取候选名称 [英] extract the name of candidate from text file using python and nltk
本文介绍了使用python和nltk从文本文件中提取候选名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
import re
import spacy
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
from nltk.corpus import wordnet
inputfile = open('inputfile.txt', 'r')
String= inputfile.read()
nlp = spacy.load('en_core_web_sm')
def candidate_name_extractor(input_string, nlp):
input_string = str(input_string)
doc = nlp(input_string)
# Extract entities
doc_entities = doc.ents
# Subset to person type entities
doc_persons = filter(lambda x: x.label_ == 'PERSON', doc_entities)
doc_persons = filter(lambda x: len(x.text.strip().split()) >= 2, doc_persons)
doc_persons = list(map(lambda x: x.text.strip(), doc_persons))
print(doc_persons)
# Assuming that the first Person entity with more than two tokens is the candidate's name
candidate_name = doc_persons[0]
return candidate_name
if __name__ == '__main__':
names = candidate_name_extractor(String, nlp)
print(names)
我想从文本文件中提取候选名称,但是返回错误的值.当我用地图删除列表时,地图也无法正常工作并给出错误
I want to extract the name of candidate from text file, but it returns the wrong value. when i remove list with map then map is also not working and gives the error
推荐答案
import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
from nltk.corpus import wordnet
String = 'Ravana was killed in a war'
Sentences = nltk.sent_tokenize(String)
Tokens = []
for Sent in Sentences:
Tokens.append(nltk.word_tokenize(Sent))
Words_List = [nltk.pos_tag(Token) for Token in Tokens]
Nouns_List = []
for List in Words_List:
for Word in List:
if re.match('[NN.*]', Word[1]):
Nouns_List.append(Word[0])
Names = []
for Nouns in Nouns_List:
if not wordnet.synsets(Nouns):
Names.append(Nouns)
print (Names)
检查此代码.我得到Ravana
作为输出.
Check this code. I am getting Ravana
as output.
我使用简历中的几句话来创建一个文本文件,并将其作为程序输入.下面仅显示代码的更改部分:
I used a few sentences from my resume to create a text file, and gave it as input to my program. Only the changed portion of the code is shown below:
import io
File = io.open("Documents\\Temp.txt", 'r', encoding = 'utf-8')
String = File.read()
String = re.sub('[/|.|@|%|\d+]', '', String)
它将返回不在wordnet
语料库中的所有姓名,例如我的名字,我的房屋名称,地点,大学名称和地点.
And it is returning all the names that are not in the wordnet
corpus, like my name, my house name, place, college name and place.
这篇关于使用python和nltk从文本文件中提取候选名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文