NLTK中是否有用于文本规范化和规范化的类? [英] Are there any classes in NLTK for text normalizing and canonizing?

查看:123
本文介绍了NLTK中是否有用于文本规范化和规范化的类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大量的NLTK文档和示例都专门用于去词皮化和词根提取,但在诸如以下这样的归一化问题上却非常稀疏:

The prevalent amount of NLTK documentation and examples is devoted to lemmatization and stemming but is very sparse on such matters of normalization as:

  • 将所有字母转换为小写或大写
  • 删除标点符号
  • 将数字转换为单词
  • 删除重音符号和其他变音符号
  • 扩展缩写
  • 删除停用词或太常见"的词
  • 文本规范化(肿瘤=肿瘤,是=是)
  • converting all letters to lower or upper case
  • removing punctuation
  • converting numbers into words
  • removing accent marks and other diacritics
  • expanding abbreviations
  • removing stopwords or "too common" words
  • text canonicalization (tumor = tumour, it's = it is)

请指出我在NLTK中要挖掘的地方.欢迎用于上述目的的任何NLTK等效项(JAVA或任何其他).谢谢.

Please point me where in NLTK to dig. Any NLTK equivalents (JAVA or any other) for aforementioned purposes are welcome. Thanks.

UPD .我已经编写了一个文本归一化的python库,用于文本到语音转换 https://github.com/soshial/文本规范化.它可能也适合您.

UPD. I have written a python library of text normalization for the text-to-speech purposes https://github.com/soshial/text-normalization. It might suit you as well.

推荐答案

在NLTK规范中,很多(子)任务都使用纯python

Also in NLTK spec a lot of (sub-)tasks are solved using purely python methods.

a)将所有字母转换为小写或大写

text='aiUOd'
print text.lower()
>> 'aiuod'
print text.upper()
>> 'AIUOD'

b)删除标点符号

text='She? Hm, why not!'
puncts='.?!'
for sym in puncts:
    text= text.replace(sym,' ')
print text
>> 'She  Hm  why not '

c)将数字转换为单词

在这里,写几张纸条并不是那么浪费,但是如果您用Google搜索的话,已经有很多已经存在的解决方案. 代码段

Here, it would be not that wasy to write a fewliner, but there are a lot of already existing solutions, if you google it. Code snippets, libraries etc

d)删除重音符号和其他变音符号

查找点 b),只需将带有变音符号的列表创建为 puncts

look up point b), just create the list with diacritics as puncts

e)扩展缩写

使用缩写创建字典:

text='USA and GB are ...'
abbrevs={'USA':'United States','GB':'Great Britain'}
for abbrev in abbrevs:
    text= text.replace(abbrev,abbrevs[abbrev])
print text
>> 'United States and Great Britain are ...'

f)删除停用词或太常见"的词

创建带有停用词的列表:

Create a list with stopwords:

text='Mary had a little lamb'
temp_corpus=text.split(' ')
stops=['a','the','had']
corpus=[token for token in temp_corpus if token not in stops]
print corpus
>> ['Mary', 'little', 'lamb']

g)文本规范化(肿瘤=肿瘤,是=)

对于肿瘤->肿瘤,请使用 regex .

for tumor-> tumour use regex.

最后但并非最不重要的一点,请注意,上面的所有示例通常都需要对真实文本进行校准,我将它们写为前进的方向.

Last, but not least, please note that all of the examples above usually need calibration on the real textes, I wrote them as the direction to go.

这篇关于NLTK中是否有用于文本规范化和规范化的类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆