我怎样才能最好地确定一个单词的正确大小写? [英] How can I best determine the correct capitalization for a word?

查看:12
本文介绍了我怎样才能最好地确定一个单词的正确大小写?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据库,其中包含只包含大写字母的句子.该数据库是技术性的,包含医学术语,我想对其进行规范化,以便大写(接近)用户期望的大小.实现这一目标的最佳方法是什么?是否有免费可用的数据集可以帮助我完成这个过程?

解决方案

一种方法是从词性标记推断大小写,例如使用 Python 自然语言工具包 (NLTK):

import nltk,redef truecase(文本):truecased_sents = [] # truecased 句子列表# 应用 POS 标记tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)])# 从 POS 标签推断大写normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent]# 将句子中的第一个单词大写normalized_sent[0] = normalized_sent[0].capitalize()# 使用正则表达式获得正确的标点符号Pretty_string = re.sub(" (?=[.,'!?:;])", "", ' '.join(normalized_sent))返回漂亮字符串

这不会是完美的,尤其是因为我不知道你的数据究竟是什么样子,但也许你能明白:

<预><代码>>>>text = 氯硝西泮已被批准作为一种抗惊厥药,可制成 0.5 毫克、1 毫克和 2 毫克片剂.它是罗氏实验室克洛诺平的通用等效物.">>>真实情况(文本)氯硝西泮已被批准用作抗惊厥药,可制成 0.5mg、1mg 和 2mg 片剂.它是 Roche Laboratories 的 Klonopin 的通用等效物."

I have a database containing sentences which only contain capitalized letters. The database is technical, containing medical terms, and I want to normalize it so that the capitalization is (close to) what the user expects. What is the best way to achieve this? Is there a freely available dataset I can use to help with the process?

解决方案

One way could be to infer capitalization from POS-tagging, for example using the Python Natural Language Toolkit (NLTK):

import nltk, re

def truecase(text):
    truecased_sents = [] # list of truecased sentences
    # apply POS-tagging
    tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)])
    # infer capitalization from POS-tags
    normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent]
    # capitalize first word in sentence
    normalized_sent[0] = normalized_sent[0].capitalize()
    # use regular expression to get punctuation right
    pretty_string = re.sub(" (?=[.,'!?:;])", "", ' '.join(normalized_sent))
    return pretty_string

This will not be perfect, especially because I don't know what your data exactely looks like, but maybe you can get the idea:

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."

这篇关于我怎样才能最好地确定一个单词的正确大小写?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆