对于仅10000个单词的字典,真正需要什么嵌入层output_dim? [英] What embedding-layer output_dim is really needed for a dictionary of just 10000 words?

查看:153
本文介绍了对于仅10000个单词的字典,真正需要什么嵌入层output_dim?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在训练一个RNN,其单词特征集大大减少,大约为10,000.我本来打算在添加RNN之前先从嵌入层开始,但是对我来说还不清楚真正需要什么维数.我知道我可以尝试不同的值(32、64等),但是我宁愿先有一些直觉.例如,如果我使用32维嵌入向量,则每个维仅需要3个不同的值即可完全描述空间(32**3>>10000).

I'm training up an RNN with a very reduced set of word features, around 10,000. I was planning on starting with an embedding layer before adding RNNs, but it is very unclear to me what dimensionality is really needed. I know that I can try out different values (32, 64, etc.), but I'd rather have some intuition going into it first. For example, if I use a 32-dimensional embedding vector, then only 3 different values are needed per dimension to fully describe the space (32**3>>10000).

或者,对于一个只有少量单词的空间,人们真的需要使用嵌入层吗?或者仅仅从输入层直接进入RNN有意义吗?

Alternatively, for a space with this small number of words, does one even really need to use an embedding layer or does it make more sense to just go from an input layer right to the RNN?

推荐答案

这是一个很好的问题,没有 good 答案.您一定要使用嵌入层,而不仅仅是直接进入LSTM/GRU.但是,嵌入层的潜在尺寸应尽可能大,同时保持峰值验证性能".对于大小合适的字典,应该选择128或256个字典.我怀疑您会看到截然不同的性能.

This is a good question that does not have a good answer. You should surely use an embedding layer and not just go straight to an LSTM/GRU. However, the latent dimension of the embedding layer should be "as large as possible while maintain peak validation performance". For a dictionary around your size, 128 or 256 should be a reasonable decision. I doubt you will see drastically different performance.

但是,真正影响您在小数据集上的结果的某些事情并未使用预训练的单词嵌入.这将导致您的嵌入残酷地过度适合您的训练数据.我建议使用 GLove 词嵌入.下载手套数据后,您可以使用它们将权重初始化到嵌入层,然后嵌入层将权重微调到您的用例.这是我用于Keras的GloVe嵌入的一些代码.它使您可以加载它们的不同大小,并缓存矩阵,以便快速运行第二遍.

However, something that will really affect your results on a small data set is not using pre-trained word embeddings. This will cause your embeddings to brutally overfit to your training data. I recommend using GLove word embeddings. After downloading the glove data, you can use them to initialize the weights to your embedding layer and then the emebdding layer will fine-tune the weights to your usecase. Here is some code I use for the GloVe embeddings with Keras. It let's you load different sizes of them and also caches the matrix so that it is fast to run the second time around.

class GloVeSize(Enum):

    tiny = 50
    small = 100
    medium = 200
    large = 300


__DEFAULT_SIZE = GloVeSize.small


def get_pretrained_embedding_matrix(word_to_index,
                                    vocab_size=10000,
                                    glove_dir="./bin/GloVe",
                                    use_cache_if_present=True,
                                    cache_if_computed=True,
                                    cache_dir='./bin/cache',
                                    size=__DEFAULT_SIZE,
                                    verbose=1):

    """
    get pre-trained word embeddings from GloVe: https://github.com/stanfordnlp/GloVe
    :param word_to_index: a word to index map of the corpus
    :param vocab_size: the vocab size
    :param glove_dir: the dir of glove
    :param use_cache_if_present: whether to use a cached weight file if present
    :param cache_if_computed: whether to cache the result if re-computed
    :param cache_dir: the directory of the project's cache
    :param size: an enumerated choice of GloVeSize
    :param verbose: the verbosity level of logging
    :return: a matrix of the embeddings
    """
    def vprint(*args, with_arrow=True):
        if verbose > 0:
            if with_arrow:
                print(">>", *args)
            else:
                print(*args)

    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)

    cache_path = os.path.join(cache_dir, 'glove_%d_embedding_matrix.npy' % size.value)
    if use_cache_if_present and os.path.isfile(cache_path):
        return np.load(cache_path)
    else:
        vprint('computing embeddings', with_arrow=True)
        embeddings_index = {}
        size_value = size.value
        f = open(os.path.join(glove_dir, 'glove.6B.' + str(size_value) + 'd.txt'),
                 encoding="ascii", errors='ignore')

        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

        f.close()
        vprint('Found', len(embeddings_index), 'word vectors.')

        embedding_matrix = np.random.normal(size=(vocab_size, size.value))

        non = 0
        for word, index in word_to_index.items():
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[index] = embedding_vector
            else:
                non += 1

        vprint(non, "words did not have mappings")
        vprint(with_arrow=False)

        if cache_if_computed:
            np.save(cache_path, embedding_matrix)

return embedding_matrix

然后使用该权重矩阵实例化嵌入层:

then instantiate your embedding layer with that weight matrix:

 embedding_size = GloVeSize.small
    embedding_matrix = get_pretrained_embedding_matrix(data.word_to_index,
size=embedding_size)

embedding = Embedding(
     output_dim=self.embedding_size,
     input_dim=self.vocabulary_size + 1,
     input_length=self.input_length,
     mask_zero=True,
     weights=[np.vstack((np.zeros((1, self.embedding_size)),
                         self.embedding_matrix))],
     name='embedding'
)(input_layer)

这篇关于对于仅10000个单词的字典,真正需要什么嵌入层output_dim?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆