在word2vec Gensim中获取二元组和三元组 [英] Get bigrams and trigrams in word2vec Gensim

查看:261
本文介绍了在word2vec Gensim中获取二元组和三元组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前在word2vec模型中使用uni-gram,如下所示.

I am currently using uni-grams in my word2vec model as follows.

def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    #Returns a list of sentences, where each sentence is a list of words
    #
    #NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())

    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, \
              remove_stopwords ))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

但是,我将错过数据集中的重要二元组和三元组.

However, then I will miss important bigrams and trigrams in my dataset.

E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"

因此,我想在数据集中捕获重要的二元组,三元组等,并将其输入到word2vec模型中.

Hence, I want to capture the important bigrams, trigrams etc. in my dataset and input into my word2vec model.

我是wordvec的新手,正在努力做到这一点.请帮助我.

I am new to wordvec and struggling how to do it. Please help me.

推荐答案

首先,您应该使用gensim的类

First of all you should use gensim's class Phrases in order to get bigrams, which works as pointed in the doc

>>> bigram = Phraser(phrases)
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']

要获取三字母组等等,您应该使用已经拥有的bigram模型,然后将短语重新应用于它,依此类推. 示例:

To get trigrams and so on, you should use the bigram model that you already have and apply Phrases to it again, and so on. Example:

trigram_model = Phrases(bigram_sentences)

还有一个不错的笔记本和视频,解释了如何使用它.... 视频

Also there is a good notebook and video that explains how to use that .... the notebook, the video

其中最重要的部分是如何在现实生活中使用它,如下所示:

The most important part of it is how to use it in real life sentences which is as follows:

// to create the bigrams
bigram_model = Phrases(unigram_sentences)

// apply the trained model to a sentence
 for unigram_sentence in unigram_sentences:                
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])

// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)

希望这对您有所帮助,但是下一次给我们提供有关您正在使用的产品等的更多信息.

Hope this helps you, but next time give us more information on what you are using and etc.

P.S:现在,您已经对其进行了编辑,为了使二元组仅被拆分,您并没有做任何事情,您必须使用短语以使像New York这样的单词成为二元组.

P.S: Now that you edited it, you are not doing anything in order to get bigrams just splitting it, you have to use Phrases in order to get words like New York as bigrams.

这篇关于在word2vec Gensim中获取二元组和三元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆