如何使用Spacy在CSV文件上执行名称实体识别 [英] How to use spacy to do Name Entity recognition on CSV file

查看:361
本文介绍了如何使用Spacy在CSV文件上执行名称实体识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经尝试了很多事情来对csv文件中的列进行名称实体识别,我尝试了ne_chunk,但是我无法在像这样的列中获得ne_chunk的结果

I have tried so many things to do name entity recognition on a column in my csv file, i tried ne_chunk but i am unable to get the result of my ne_chunk in columns like so

ID  STORY                                       PERSON  NE   NP  NN VB  GE
1   Washington, a police officer James...        1      0    0   0   0   1

使用此代码后,

news=pd.read_csv("news.csv")

news['tokenize'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)


news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['tokenize']), axis=1)

news['entityrecog']=news.apply(lambda row: nltk.ne_chunk(row['pos_tags']), axis=1)

tag_count_df = pd.DataFrame(news['entityrecognition'].map(lambda x: Counter(tag[1] for tag in x)).to_list())

news=pd.concat([news, tag_count_df], axis=1).fillna(0).drop(['entityrecognition'], axis=1)

news.to_csv("news.csv")

我遇到了这个错误

IndexError : list index out of range

所以,我想知道我是否可以使用spaCy做到这一点,这是我一无所知的另一件事.有人可以帮忙吗?

So, i am wondering if i could do this using spaCy which is another thing that i have no clue about. Can anyone help?

推荐答案

似乎您检查的块不正确,这就是为什么会出现关键错误的原因.我对您要做什么有些猜测,但这会为NLTK返回的每个NER类型创建新的列.可以预定义每个NER类型的列并将其清零(因为如果NER不存在,这将为您提供NaN),这样会更干净一些.

It seems that you are checking the chunks incorrectly, that's why you get a key error. I'm guessing a little about what you want to do, but this creates new columns for each NER type returned by NLTK. It would be a little cleaner to predefined and zero each NER type column (as this gives you NaN if NERs don't exist).

def extract_ner_count(tagged):
    entities = {}
    chunks = nltk.ne_chunk(tagged)
    for chunk in chunks:
        if type(chunk) is nltk.Tree:
          #if you don't need the entities, just add the label directly rather than this.
          t = ''.join(c[0] for c in chunk.leaves())
          entities[t] = chunk.label()
    return Counter(entities.values())

news=pd.read_csv("news.csv")
news['tokenize'] = news.apply(lambda row: nltk.word_tokenize(row['STORY']), axis=1)
news['pos_tags'] = news.apply(lambda row: nltk.pos_tag(row['tokenize']), axis=1)
news['entityrecognition']=news.apply(lambda row: extract_ner_count(row['pos_tags']), axis=1)
news = pd.concat([news, pd.DataFrame(list(news["entityrecognition"]))], axis=1)

print(news.head())

如果您想要的只是计数,则以下代码表现更好,并且没有NaN:

If all you want is the counts the following is more performant and doesn't have NaNs:

tagger = nltk.PerceptronTagger()
chunker = nltk.data.load(nltk.chunk._MULTICLASS_NE_CHUNKER)
NE_Types = {'GPE', 'ORGANIZATION', 'LOCATION', 'GSP', 'O', 'FACILITY', 'PERSON'}

def extract_ner_count(text):
    c = Counter()
    chunks = chunker.parse(tagger.tag(nltk.word_tokenize(text,preserve_line=True)))
    for chunk in chunks:
        if type(chunk) is nltk.Tree:
            c.update([chunk.label()])
    return c

news=pd.read_csv("news.csv")
for NE_Type in NE_Types:
    news[NE_Type] = 0
news.update(list(news["STORY"].apply(extract_ner_count)))

print(news.head())

这篇关于如何使用Spacy在CSV文件上执行名称实体识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆