如何最好地确定单词的正确大写? [英] How can I best determine the correct capitalization for a word?

查看:95
本文介绍了如何最好地确定单词的正确大写?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据库,其中包含仅包含大写字母的句子.该数据库是技术性的,包含医学术语,我想对其进行规范化,以便大写字母(接近)用户期望的大小写.实现此目标的最佳方法是什么?是否有免费可用的数据集可用来帮助您完成该过程?

I have a database containing sentences which only contain capitalized letters. The database is technical, containing medical terms, and I want to normalize it so that the capitalization is (close to) what the user expects. What is the best way to achieve this? Is there a freely available dataset I can use to help with the process?

推荐答案

一种方法可能是从POS标记中推断大小写,例如使用Python Natural Language Toolkit(NLTK):

One way could be to infer capitalization from POS-tagging, for example using the Python Natural Language Toolkit (NLTK):

import nltk, re

def truecase(text):
    truecased_sents = [] # list of truecased sentences
    # apply POS-tagging
    tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)])
    # infer capitalization from POS-tags
    normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent]
    # capitalize first word in sentence
    normalized_sent[0] = normalized_sent[0].capitalize()
    # use regular expression to get punctuation right
    pretty_string = re.sub(" (?=[\.,'!?:;])", "", ' '.join(normalized_sent))
    return pretty_string

这不是完美的,特别是因为我不知道您的数据到底是什么样子,但是也许您可以理解:

This will not be perfect, especially because I don't know what your data exactely looks like, but maybe you can get the idea:

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."

这篇关于如何最好地确定单词的正确大写?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆