Python-如何使用NLP从缩写文本中直觉单词? [英] Python - How to intuit word from abbreviated text using NLP?

查看:171
本文介绍了Python-如何使用NLP从缩写文本中直觉单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近正在处理一个使用各种单词的缩写的数据集.例如,

I was recently working on a data set that used abbreviations for various words. For example,

wtrbtl = water bottle
bwlingbl = bowling ball
bsktball = basketball

在使用的约定方面似乎没有任何一致性,即有时他们有时使用元音.我正在尝试为上面的缩写及其对应的单词构建一个映射对象,而没有完整的语料库或完整的术语列表(即可能引入未明确知道的缩写).为简单起见,请说它仅限于您在健身房找到的东西,但可以是任何东西.

There did not seem to be any consistency in terms of the convention used, i.e. sometimes they used vowels sometimes not. I am trying to build a mapping object like the one above for abbreviations and their corresponding words without a complete corpus or comprehensive list of terms (i.e. abbreviations could be introduced that are not explicitly known). For simplicity sake say it is restricted to stuff you would find in a gym but it could be anything.

基本上,如果仅查看示例的左侧,那么在将每个缩写与相应的全文标签相关联方面,什么样的模型可以与我们的大脑进行相同的处理.

Basically, if you only look at the left hand side of the examples, what kind of model could do the same processing as our brain in terms of relating each abbreviation to the corresponding full text label.

我的想法已经停止,只取第一个和最后一个字母,然后在字典中查找这些字母.然后根据上下文分配先验概率.但是,由于有许多没有表示单词结尾的标记的词素,所以我看不出如何将它们分割.

My ideas have stopped at taking the first and last letter and finding those in a dictionary. Then assign a priori probabilities based on context. But since there are a large number of morphemes without a marker that indicates end of word I don't see how its possible to split them.

已更新:

我还想到了将诸如匹配评级算法之类的几个字符串度量算法组合在一起以确定一组相关术语,然后计算该集合中每个单词到目标缩写之间的Levenshtein距离.但是,对于不在主词典中的单词的缩写,我还是一无所知.基本上,推断词的构造-可能会使用朴素贝叶斯模型可能会有所帮助,但我担心由于使用上述算法而导致的精度错误将使任何模型训练过程无效.

I also had the idea to combine a couple string metric algorithms like a Match Rating Algorithm to determine a set of related terms and then calculate the Levenshtein Distance between each word in the set to the target abbreviation. However, I am still in the dark when it comes to abbreviations for words not in a master dictionary. Basically, inferring word construction - may a Naive Bayes model could help but I am concerned that any error in precision caused by using the algorithms above will invalid any model training process.

任何帮助都值得赞赏,因为我真的很喜欢这个.

Any help is appreciated, as I am really stuck on this one.

推荐答案

如果找不到详尽的字典,则可以构建(或下载)概率语言模型,以便为您生成和评估候选句子.它可以是字符n元语法模型或神经网络.

If you cannot find an exhaustive dictionary, you could build (or download) a probabilistic language model, to generate and evaluate sentence candidates for you. It could be a character n-gram model or a neural network.

对于您的缩写,您可以构建一个噪声模型",以预测字符遗漏的可能性.它可以从语料库(您必须手动或半手动地标记)中得知,辅音的丢失频率比元音少.

For your abbreviations, you can build a "noise model" which predicts probability of character omissions. It can learn from a corpus (you have to label it manually or half-manually) that consonants are missed less frequently than vowels.

具有复杂的语言模型和简单的噪声模型,您可以使用嘈杂的频道方法将它们组合起来(请参见例如

Having a complex language model and a simple noise model, you can combine them using noisy channel approach (see e.g. the article by Jurafsky for more details), to suggest candidate sentences.

更新.我对此问题充满热情并实现了该算法:

Update. I got enthusiastic about this problem and implemented this algorithm:

  • 语言模型(在《指环王》文字上训练过的5字字符)
  • 噪声模型(每个符号的概率缩写)
  • 波束搜索算法,用于候选短语建议.

我的解决方案是在此Python笔记本中实施的.对于训练有素的模型,它具有类似noisy_channel('bsktball', language_model, error_model)的界面,顺便说一下,该界面返回 {'basket ball': 33.5, 'basket bally': 36.0}.字典值是建议的分数(越低越好).

My solution is implemented in this Python notebook. With trained models, it has interface like noisy_channel('bsktball', language_model, error_model), which, by the way, returns {'basket ball': 33.5, 'basket bally': 36.0}. Dictionary values are scores of the suggestions (the lower, the better).

在其他示例中,效果更糟:对于"wtrbtl",它返回

With other examples it works worse: for 'wtrbtl' it returns

{'water but all': 23.7, 
 'water but ill': 24.5,
 'water but lay': 24.8,
 'water but let': 26.0,
 'water but lie': 25.9,
 'water but look': 26.6}

对于"bwlingbl",它给出

For 'bwlingbl' it gives

{'bwling belia': 32.3,
 'bwling bell': 33.6,
 'bwling below': 32.1,
 'bwling belt': 32.5,
 'bwling black': 31.4,
 'bwling bling': 32.9,
 'bwling blow': 32.7,
 'bwling blue': 30.7}

但是,当在适当的语料库(例如体育杂志和博客;可能是名词过度采样)上进行训练时,或者在波束搜索的宽度更大时,此模型将提供更多相关建议.

However, when training on an appropriate corpus (e.g. sports magazines and blogs; maybe with oversampling of nouns), and maybe with more generous width of beam search, this model will provide more relevant suggestions.

这篇关于Python-如何使用NLP从缩写文本中直觉单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆