如何识别缩写词/首字母缩略词并在 spaCy 中扩展它们? [英] How to identify abbreviations/acronyms and expand them in spaCy?

查看:126
本文介绍了如何识别缩写词/首字母缩略词并在 spaCy 中扩展它们?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的(~50k)术语列表,其中许多关键短语/术语都有相应的首字母缩略词/缩写词.我需要一种快速查找缩写或扩展缩写(即 MS -> Microsoft )的方法,然后将其替换为完整的扩展缩写 + 缩写(即 Microsoft -> Microsoft (MS) 或 MS -> Microsoft (MS) ).

I have a large (~50k) term list and a number of these key phrases / terms have corresponding acronyms / abbreviations. I need a fast way of finding either the abbreviation or the expanded abbreviation ( i.e. MS -> Microsoft ) and then replacing that with the full expanded abbreviation + abbreviation ( i.e. Microsoft -> Microsoft (MS) or MS -> Microsoft (MS) ).

我对 spaCy 很陌生,所以我天真的方法是使用 spacy_lookup 并使用缩写和扩展缩写作为关键字,然后使用某种 pipeline extension 然后通过匹配项并将它们替换为完整的扩展缩写 + 缩写.

I am very new to spaCy, so my naive approach was going to be to use spacy_lookup and use both the abbreviation and the expanded abbreviation as keywords and then using some kind of pipeline extension to then go through the matches and replace them with the full expanded abbreviation + abbreviation.

在 spaCy 中是否有更好的方法来标记和解析首字母缩略词/缩写词?

Is there a better way of tagging and resolving acronyms/abbreviations in spaCy?

推荐答案

查看 scispacyGitHub,它实现了本文中描述的首字母缩略词识别启发式,(另请参阅此处).如果首字母缩略词在文本中以

Check out scispacy on GitHub, which implements the acronym identification heuristic described in this paper, (see also here). The heuristic works if acronyms are "introduced" in the text with a pattern like

StackOverflow (SO) 是一个面向专业和发烧友程序员的问答网站.太棒了!

StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!

一种用长格式替换一段文本中所有首字母缩略词的可行方法是

A working way to replace all acronyms in a piece of text with their long form could then be

import spacy
from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_web_sm")

abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)

text = "StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!"

def replace_acronyms(text):
    doc = nlp(text)
    altered_tok = [tok.text for tok in doc]
    for abrv in doc._.abbreviations:
        altered_tok[abrv.start] = str(abrv._.long_form)

    return(" ".join(altered_tok))

replace_acronyms(text)

这篇关于如何识别缩写词/首字母缩略词并在 spaCy 中扩展它们?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆