NLTK标记化-更快的方法? [英] NLTK tokenize - faster way?

查看:122
本文介绍了NLTK标记化-更快的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个接受String参数的方法,并使用NLTK将String分解为句子,然后分解为单词.然后,它将每个单词转换为小写,最后创建每个单词出现频率的字典.

I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dictionary of the frequency of each word.

import nltk
from collections import Counter

def freq(string):
    f = Counter()
    sentence_list = nltk.tokenize.sent_tokenize(string)
    for sentence in sentence_list:
        words = nltk.word_tokenize(sentence)
        words = [word.lower() for word in words]
        for word in words:
            f[word] += 1
    return f

我应该进一步优化上述代码,以缩短预处理时间,并且不确定如何这样做.返回值显然应该与上面的完全相同,因此尽管没有明确要求这样做,但我还是希望使用nltk.

I'm supposed to optimize the above code further to result in faster preprocessing time, and am unsure how to do so. The return value should obviously be exactly the same as the above, so I'm expected to use nltk though not explicitly required to do so.

有什么方法可以加快上述代码的速度吗?谢谢.

Any way to speed up the above code? Thanks.

推荐答案

如果只需要一个简单的令牌列表,请注意word_tokenize会隐式调用sent_tokenize,请参见

If you just want a flat list of tokens, note that word_tokenize would call sent_tokenize implicitly, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L98

_treebank_word_tokenize = TreebankWordTokenizer().tokenize
def word_tokenize(text, language='english'):
    """
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).
    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus
    """
    return [token for sent in sent_tokenize(text, language)
            for token in _treebank_word_tokenize(sent)]

以棕色语料为例,以Counter(word_tokenize(string_corpus)):

>>> from collections import Counter
>>> from nltk.corpus import brown
>>> from nltk import sent_tokenize, word_tokenize
>>> string_corpus = brown.raw() # Plaintext, str type.
>>> start = time.time(); fdist = Counter(word_tokenize(string_corpus)); end = time.time() - start
>>> end
12.662328958511353
>>> fdist.most_common(5)
[(u',', 116672), (u'/', 89031), (u'the/at', 62288), (u'.', 60646), (u'./', 48812)]
>>> sum(fdist.values())
1423314

〜140万个单词花了12秒钟(不保存标记化语料库)在我的机器上符合以下规范:

~1.4 million words took 12 secs (without saving the tokenized corpus) on my machine with specs:

alvas@ubi:~$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 69
model name  : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
stepping    : 1
microcode   : 0x17
cpu MHz     : 1600.027
cache size  : 3072 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 2

$ cat /proc/meminfo
MemTotal:       12004468 kB

先保存标记化语料库,然后再使用Counter(chain*(tokenized_corpus)):

Saving the tokenized corpus first tokenized_corpus = [word_tokenize(sent) for sent in sent_tokenize(string_corpus)], then using Counter(chain*(tokenized_corpus)):

>>> from itertools import chain
>>> start = time.time(); tokenized_corpus = [word_tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start
>>> end
16.421464920043945

使用ToktokTokenizer()

>>> from collections import Counter
>>> import time
>>> from itertools import chain
>>> from nltk.corpus import brown
>>> from nltk import sent_tokenize, word_tokenize
>>> from nltk.tokenize import ToktokTokenizer
>>> toktok = ToktokTokenizer()
>>> string_corpus = brown.raw()

>>> start = time.time(); tokenized_corpus = [toktok.tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start 
>>> end
10.00472116470337

使用MosesTokenizer():

>>> from nltk.tokenize.moses import MosesTokenizer
>>> moses = MosesTokenizer()
>>> start = time.time(); tokenized_corpus = [moses.tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start 
>>> end
30.783339023590088
>>> start = time.time(); tokenized_corpus = [moses.tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start 
>>> end
30.559681177139282

为什么使用MosesTokenizer

Why use MosesTokenizer

它的实现方式是可以将令牌反向转换为字符串,即"detokenize".

It was implemented in such a way that there is a way to reverse the tokens back to string, i.e. "detokenize".

>>> from nltk.tokenize.moses import MosesTokenizer, MosesDetokenizer
>>> t, d = MosesTokenizer(), MosesDetokenizer()
>>> sent = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?"
>>> expected_tokens = [u'This', u'ain', u'&apos;t', u'funny.', u'It', u'&apos;s', u'actually', u'hillarious', u',', u'yet', u'double', u'Ls.', u'&#124;', u'&#91;', u'&#93;', u'&lt;', u'&gt;', u'&#91;', u'&#93;', u'&amp;', u'You', u'&apos;re', u'gonna', u'shake', u'it', u'off', u'?', u'Don', u'&apos;t', u'?']
>>> expected_detokens = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?"
>>> tokens = t.tokenize(sent)
>>> tokens == expected_tokens
True
>>> detokens = d.detokenize(tokens)
>>> " ".join(detokens) == expected_detokens
True

使用ReppTokenizer():

>>> repp = ReppTokenizer('/home/alvas/repp')
>>> start = time.time(); sentences = sent_tokenize(string_corpus); tokenized_corpus = repp.tokenize_sents(sentences); fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start
>>> end
76.44129395484924

为什么使用ReppTokenizer?

Why use ReppTokenizer?

它返回原始字符串中标记的偏移量.

It returns offset of the tokens from in the original string.

>>> sents = ['Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve.' ,
... 'But rule-based tokenizers are hard to maintain and their rules language specific.' ,
... 'We evaluated our method on three languages and obtained error rates of 0.27% (English), 0.35% (Dutch) and 0.76% (Italian) for our best models.'
... ]
>>> tokenizer = ReppTokenizer('/home/alvas/repp/') # doctest: +SKIP
>>> for sent in sents:                             # doctest: +SKIP
...     tokenizer.tokenize(sent)                   # doctest: +SKIP
... 
(u'Tokenization', u'is', u'widely', u'regarded', u'as', u'a', u'solved', u'problem', u'due', u'to', u'the', u'high', u'accuracy', u'that', u'rulebased', u'tokenizers', u'achieve', u'.')
(u'But', u'rule-based', u'tokenizers', u'are', u'hard', u'to', u'maintain', u'and', u'their', u'rules', u'language', u'specific', u'.')
(u'We', u'evaluated', u'our', u'method', u'on', u'three', u'languages', u'and', u'obtained', u'error', u'rates', u'of', u'0.27', u'%', u'(', u'English', u')', u',', u'0.35', u'%', u'(', u'Dutch', u')', u'and', u'0.76', u'%', u'(', u'Italian', u')', u'for', u'our', u'best', u'models', u'.')
>>> for sent in tokenizer.tokenize_sents(sents): 
...     print sent                               
... 
(u'Tokenization', u'is', u'widely', u'regarded', u'as', u'a', u'solved', u'problem', u'due', u'to', u'the', u'high', u'accuracy', u'that', u'rulebased', u'tokenizers', u'achieve', u'.')
(u'But', u'rule-based', u'tokenizers', u'are', u'hard', u'to', u'maintain', u'and', u'their', u'rules', u'language', u'specific', u'.')
(u'We', u'evaluated', u'our', u'method', u'on', u'three', u'languages', u'and', u'obtained', u'error', u'rates', u'of', u'0.27', u'%', u'(', u'English', u')', u',', u'0.35', u'%', u'(', u'Dutch', u')', u'and', u'0.76', u'%', u'(', u'Italian', u')', u'for', u'our', u'best', u'models', u'.')
>>> for sent in tokenizer.tokenize_sents(sents, keep_token_positions=True): 
...     print sent
... 
[(u'Tokenization', 0, 12), (u'is', 13, 15), (u'widely', 16, 22), (u'regarded', 23, 31), (u'as', 32, 34), (u'a', 35, 36), (u'solved', 37, 43), (u'problem', 44, 51), (u'due', 52, 55), (u'to', 56, 58), (u'the', 59, 62), (u'high', 63, 67), (u'accuracy', 68, 76), (u'that', 77, 81), (u'rulebased', 82, 91), (u'tokenizers', 92, 102), (u'achieve', 103, 110), (u'.', 110, 111)]
[(u'But', 0, 3), (u'rule-based', 4, 14), (u'tokenizers', 15, 25), (u'are', 26, 29), (u'hard', 30, 34), (u'to', 35, 37), (u'maintain', 38, 46), (u'and', 47, 50), (u'their', 51, 56), (u'rules', 57, 62), (u'language', 63, 71), (u'specific', 72, 80), (u'.', 80, 81)]
[(u'We', 0, 2), (u'evaluated', 3, 12), (u'our', 13, 16), (u'method', 17, 23), (u'on', 24, 26), (u'three', 27, 32), (u'languages', 33, 42), (u'and', 43, 46), (u'obtained', 47, 55), (u'error', 56, 61), (u'rates', 62, 67), (u'of', 68, 70), (u'0.27', 71, 75), (u'%', 75, 76), (u'(', 77, 78), (u'English', 78, 85), (u')', 85, 86), (u',', 86, 87), (u'0.35', 88, 92), (u'%', 92, 93), (u'(', 94, 95), (u'Dutch', 95, 100), (u')', 100, 101), (u'and', 102, 105), (u'0.76', 106, 110), (u'%', 110, 111), (u'(', 112, 113), (u'Italian', 113, 120), (u')', 120, 121), (u'for', 122, 125), (u'our', 126, 129), (u'best', 130, 134), (u'models', 135, 141), (u'.', 141, 142)]


TL; DR

不同令牌生成器的优势

  • word_tokenize()隐式调用sent_tokenize()
  • ToktokTokenizer()最快
  • MosesTokenizer()可以解密文本
  • ReppTokenizer()能够提供令牌偏移量
  • word_tokenize() implicitly calls sent_tokenize()
  • ToktokTokenizer() is fastest
  • MosesTokenizer() is able to detokenize text
  • ReppTokenizer() is able to provide token offsets

问:是否有一种快速的分词器,可以将其进行分词,并且还可以为我提供偏移量,并且还可以在NLTK中进行句子分词?

A:我不这么认为,请尝试gensimspacy.

A: I don't think so, try gensim or spacy.

这篇关于NLTK标记化-更快的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆