趋词化机制是否会减少语料库的大小? [英] Does the lemmatization mechanism reduce the size of the corpus?

查看:90
本文介绍了趋词化机制是否会减少语料库的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尊敬的社区成员,

在数据的预处理过程中,将raw_data拆分为令牌后,我使用了流行的WordNet Lemmatizer来生成词干.我正在对具有18953个令牌的数据集进行实验.

During the pre-processing of data, after splitting the raw_data into tokens, I have used the popular WordNet Lemmatizer to generate the stems. I am performing experiments on a dataset that has 18953 tokens.

我的问题是,去词化过程是否会减少语料库的大小? 我很困惑,在这方面请您帮忙.感谢您的帮助!

My question is, does the lemmatization process reduce the size of corpus? I am confused, kindly help in this regard. Any help is appreciated!

推荐答案

合法化将句子中的每个标记(aka form)转换为其引理形式(aka type):

Lemmatization converts each token (aka form) in the sentence into its lemma form (aka type):

>>> from nltk import word_tokenize
>>> from pywsd.utils import lemmatize_sentence

>>> text = ['This is a corpus with multiple sentences.', 'This was the second sentence running.', 'For some reasons, there is a need to second foo bar ran.']

>>> lemmatize_sentence(text[0]) # Lemmatized sentence example.
['this', 'be', 'a', 'corpus', 'with', 'multiple', 'sentence', '.']
>>> word_tokenize(text[0]) # Tokenized sentence example. 
['This', 'is', 'a', 'corpus', 'with', 'multiple', 'sentences', '.']
>>> word_tokenize(text[0].lower()) # Lowercased and tokenized sentence example.
['this', 'is', 'a', 'corpus', 'with', 'multiple', 'sentences', '.']

如果我们对句子进行词素化,则每个标记都应收到相应的词素形式,因此否.无论是form还是type:

If we lemmatize the sentence, each token should receive the corresponding lemma form, so the no. of "words" remains the same whether it's the form or the type:

>>> num_tokens = sum([len(word_tokenize(sent.lower())) for sent in text])
>>> num_lemmas = sum([len(lemmatize_sentence(sent)) for sent in text])
>>> num_tokens, num_lemmas
(29, 29)


>>> [lemmatize_sentence(sent) for sent in text] # lemmatized sentences
[['this', 'be', 'a', 'corpus', 'with', 'multiple', 'sentence', '.'], ['this', 'be', 'the', 'second', 'sentence', 'running', '.'], ['for', 'some', 'reason', ',', 'there', 'be', 'a', 'need', 'to', 'second', 'foo', 'bar', 'ran', '.']]

>>> [word_tokenize(sent.lower()) for sent in text] # tokenized sentences
[['this', 'is', 'a', 'corpus', 'with', 'multiple', 'sentences', '.'], ['this', 'was', 'the', 'second', 'sentence', 'running', '.'], ['for', 'some', 'reasons', ',', 'there', 'is', 'a', 'need', 'to', 'second', 'foo', 'bar', 'ran', '.']]

压缩"本身是指在对句子进行了词形化之后,在整个语料库中表示的唯一标记的数量.

The "compression" per-se would refer to the number of unique tokens represented in the whole corpus after you've lemmatized the sentences, e.g.

>>> lemma_vocab = set(chain(*[lemmatize_sentence(sent) for sent in text]))
>>> token_vocab = set(chain(*[word_tokenize(sent.lower()) for sent in text]))
>>> len(lemma_vocab), len(token_vocab)
(21, 23)

>>> lemma_vocab
{'the', 'this', 'to', 'reason', 'for', 'second', 'a', 'running', 'some', 'sentence', 'be', 'foo', 'ran', 'with', '.', 'need', 'multiple', 'bar', 'corpus', 'there', ','}
>>> token_vocab
{'the', 'this', 'to', 'for', 'sentences', 'a', 'second', 'running', 'some', 'is', 'sentence', 'foo', 'reasons', 'with', 'ran', '.', 'need', 'multiple', 'bar', 'corpus', 'there', 'was', ','}


注意:合法化是一个预处理步骤.但是它应该用词形化形式覆盖您的原始语料库.


Note: Lemmatization is a pre-processing step. But it should not overwrite your original corpus with the lemmatize forms.

这篇关于趋词化机制是否会减少语料库的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆