使用 spacy 用实体标签替换实体时重复实体 [英] Repeating entity in replacing entity with their entity label using spacy

查看:83
本文介绍了使用 spacy 用实体标签替换实体时重复实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

代码:

import spacy
nlp = spacy.load("en_core_web_md")

#read txt file, each string on its own line
with open("./try.txt","r") as f:
    texts = f.read().splitlines()

#substitute entities with their TAGS
docs = nlp.pipe(texts)
out = []
for doc in docs:
    out_ = ""
    for tok in doc:
        text = tok.text
        if tok.ent_type_:
            text = tok.ent_type_
        out_ += text + tok.whitespace_
    out.append(out_)

# write to file
with open("./out_try.txt","w") as f:
    f.write("\n".join(out))

输入文件的内容:

Georgia recently became the first U.S. state to "ban Muslim culture.
His friend Nicolas J. Smith is here with Bart Simpon and Fred.
Apple is looking at buying U.K. startup for $1 billion

输出文件的内容:

GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON PERSON PERSON is here with PERSON PERSON and PERSON.
ORG is looking at buying GPE startup for MONEYMONEY MONEY

我需要在上面的句子中避免这个问题.例如在(在句子 2 'PERSON PERSON PERSON' 中成为一个实体 PERSON.谢谢

I need to avoid this problem in above sentences. for example in (in sentence 2 'PERSON PERSON PERSON' to become one entity PERSON. Thanks

推荐答案

让我们试试:

import spacy
from spacy.gold import biluo_tags_from_offsets, spans_from_biluo_tags
nlp = spacy.load("en_core_web_md")

#read txt file, each string on its own line
with open("./try.txt","r") as f:
    texts = f.read().splitlines()

docs = nlp.pipe(texts)
out_text = ""
for doc in docs:
    offsets = []
    for ent in doc.ents:
        offsets.append((ent.start_char, ent.end_char, ent.label_))
    tags = biluo_tags_from_offsets(doc, offsets)
    text = *zip([tok for tok in doc],tags),
    out = []
    for item in text:
        tag = item[1].split("-")
        if tag[0] == "O":
            out.append(item[0].text+item[0].whitespace_)
        if tag[0] == "U":
            out.append(item[0].ent_type_+item[0].whitespace_)
        elif tag[0] == "L":
            out.append(item[0].ent_type_+item[0].whitespace_)
    out_text += "".join(out)+"\n"

with open("out_try.txt","w") as f:
    f.write(out_text)

输出文件的内容:

GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON is here with PERSON and PERSON.
ORG is looking at buying GPE startup for MONEY

这篇关于使用 spacy 用实体标签替换实体时重复实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆