如何处理诸如luv,kool和brb之类的Tweet中的语单词和简短形式? [英] How to handle slang words and short forms in Tweets like luv , kool and brb?

查看:83
本文介绍了如何处理诸如luv,kool和brb之类的Tweet中的语单词和简短形式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python进行推文的预处理.但是,使用的许多单词都是luv,kool等其他单词的缩写形式.而且,缩写词还包括brb,ttyl等.

I am doing preprocessing of tweets using Python. However, a lot of words used are short forms of other words like luv, kool etc. And also, abbreviations like brb , ttyl etc.

现在,我只能想到有一个巨大的Hashmap,其中单词作为键,而实际单词或扩展作为值.还有其他更好的方法可以使用NLP来解决此问题吗?

Right now, I can only think of having a huge Hashmap with words as keys and the actual words or expansions as values. Is there any other better way to approach this using NLP ?

注意:我知道问题似乎太模糊了.但是请不要报告.我问这个问题是为了让业余爱好者可以从这些知识中受益

NOTE : I know question seems too vague. But please dont report it. I have asked this so that amateurs can benefit from this knowledge

PS:是否可以下载和使用格式正确的文本列表?放下的链接很好,但是当我复制并粘贴时-它们不是易于解析的格式

PS : Is there a nicely formatted text list that I can download and use? The links put down are good , but when i copy and paste it - they are not in an easily parsable format

推荐答案

解密缩写的唯一方法是使用外部资源.这就是为什么有很多人类缩写词典的原因.虽然,人类可以使用常识和已知的缩写来预测含义,但即使他们做得不好,因此对于NLP来说现在也没有希望.

The only way to decipher abbreviations is to use external resources. That is why there are many dictionaries of abbreviations for humans. Although, humans can predict meaning by using common-sense knowledge and already known abbreviation, but even they do it badly, so no hope for NLP at this time.

有时也可以在同一文本中找到缩写的定义,但twitter或(not and)lang语并非如此.

Sometimes it is also possible to find definitions of abbreviations in the same text, but it is not the case for twitter or (not and) slang.

因此,是的,您必须存储从首字母缩写词到其扩展名的映射.为了获得它们,搜索首字母缩写词字典,例如. 此语词典,或那-似乎是最容易解析.

So, yes, you have to store mapping from acronyms to their expansions. In order to obtain them, search for acronyms dictionary, e.g. this slang dictionary, or that, or that, or that - seems to be the easiest for parsing.

对于其他类似"kool"的语,您可以尝试使用拼写纠正算法,请参阅相关问题 a>.

As for other slang like 'kool', you can try spell correction algorithms, see related question.

这篇关于如何处理诸如luv,kool和brb之类的Tweet中的语单词和简短形式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆