我怎样才能最好地确定一个单词的正确大小写? [英] How can I best determine the correct capitalization for a word?
问题描述
我有一个数据库,其中包含只包含大写字母的句子.该数据库是技术性的,包含医学术语,我想对其进行规范化,以便大写(接近)用户期望的大小.实现这一目标的最佳方法是什么?是否有免费可用的数据集可以帮助我完成这个过程?
一种方法是从词性标记推断大小写,例如使用 Python 自然语言工具包 (NLTK):
import nltk,redef truecase(文本):truecased_sents = [] # truecased 句子列表# 应用 POS 标记tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)])# 从 POS 标签推断大写normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent]# 将句子中的第一个单词大写normalized_sent[0] = normalized_sent[0].capitalize()# 使用正则表达式获得正确的标点符号Pretty_string = re.sub(" (?=[.,'!?:;])", "", ' '.join(normalized_sent))返回漂亮字符串
这不会是完美的,尤其是因为我不知道你的数据究竟是什么样子,但也许你能明白:
<预><代码>>>>text = 氯硝西泮已被批准作为一种抗惊厥药,可制成 0.5 毫克、1 毫克和 2 毫克片剂.它是罗氏实验室克洛诺平的通用等效物.">>>真实情况(文本)氯硝西泮已被批准用作抗惊厥药,可制成 0.5mg、1mg 和 2mg 片剂.它是 Roche Laboratories 的 Klonopin 的通用等效物."I have a database containing sentences which only contain capitalized letters. The database is technical, containing medical terms, and I want to normalize it so that the capitalization is (close to) what the user expects. What is the best way to achieve this? Is there a freely available dataset I can use to help with the process?
One way could be to infer capitalization from POS-tagging, for example using the Python Natural Language Toolkit (NLTK):
import nltk, re
def truecase(text):
truecased_sents = [] # list of truecased sentences
# apply POS-tagging
tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)])
# infer capitalization from POS-tags
normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent]
# capitalize first word in sentence
normalized_sent[0] = normalized_sent[0].capitalize()
# use regular expression to get punctuation right
pretty_string = re.sub(" (?=[.,'!?:;])", "", ' '.join(normalized_sent))
return pretty_string
This will not be perfect, especially because I don't know what your data exactely looks like, but maybe you can get the idea:
>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."
这篇关于我怎样才能最好地确定一个单词的正确大小写?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!