如何用第一个解决方案初始化第二个手套模型? [英] How to initialize second glove model with solution from first?

查看:58
本文介绍了如何用第一个解决方案初始化第二个手套模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实现有关如何在text2vec中对齐两个GloVe模型的问题的解决方案之一..我不了解在GlobalVectors$new(..., init = list(w_i, w_j)处输入的正确值是什么.如何确保w_iw_j的值正确?

I am trying to implement one of the solutions to the question about How to align two GloVe models in text2vec?. I don't understand what are the proper values for input at GlobalVectors$new(..., init = list(w_i, w_j). How do I ensure the values for w_i and w_j are correct?

这是一个最小的可复制示例.首先,准备一些语料库以进行比较,取材于Quanteda教程.我正在使用dfm_match(all_words)尝试确保所有单词都出现在每个集合中,但这似乎没有达到预期的效果.

Here's a minimal reproducible example. First, prepare some corpora to compare, taken from the quanteda tutorial. I am using dfm_match(all_words) to try and ensure all words are present in each set, but this doesn't seem to have the desired effect.

library(quanteda)

# from https://quanteda.io/articles/pkgdown/replication/text2vec.html

# get a list of all words in all documents
all_words <-
  data_corpus_inaugural %>% 
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE) %>% 
  types()

# should expect this mean features in each set
length(all_words)

# these are our three sets that we want to compare, we want to project the
# change in a few key words on a fixed background of other words
corpus_1 <- data_corpus_inaugural[1:19]
corpus_2 <- data_corpus_inaugural[20:39]
corpus_3 <- data_corpus_inaugural[40:58]

my_tokens1 <- texts(corpus_1) %>%
  char_tolower() %>%
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE) 

my_tokens2 <- texts(corpus_2) %>%
  char_tolower() %>%
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE) 

my_tokens3 <- texts(corpus_3) %>%
  char_tolower() %>%
  tokens(remove_punct = TRUE,
         remove_symbols = TRUE,
         remove_numbers = TRUE) 

my_feats1 <- 
  dfm(my_tokens1, verbose = TRUE) %>%
  dfm_trim(min_termfreq = 5) %>% 
  dfm_match(all_words) %>% 
  featnames()

my_feats2 <- 
  dfm(my_tokens2, verbose = TRUE) %>%
  dfm_trim(min_termfreq = 5) %>%
  dfm_match(all_words) %>% 
  featnames()

my_feats3 <- 
  dfm(my_tokens3, verbose = TRUE) %>%
  dfm_trim(min_termfreq = 5) %>%
  dfm_match(all_words) %>% 
  featnames()

# leave the pads so that non-adjacent words will not become adjacent
my_toks1_2 <- tokens_select(my_tokens1, my_feats1, padding = TRUE)
my_toks2_2 <- tokens_select(my_tokens2, my_feats2, padding = TRUE)
my_toks3_2 <- tokens_select(my_tokens3, my_feats3, padding = TRUE)

# Construct the feature co-occurrence matrix
my_fcm1 <- fcm(my_toks1_2, context = "window", tri = TRUE)
my_fcm2 <- fcm(my_toks2_2, context = "window", tri = TRUE)
my_fcm3 <- fcm(my_toks3_2, context = "window", tri = TRUE)

在上述步骤中的某个地方,我相信我需要确保每个集合的fcm具有所有集合的所有单词,以使矩阵维数相同,但是我不确定如何实现.

Somewhere in the above steps I believe I need to ensure that the fcm for each set has all the words of all sets to get the matrix dimensions the same, but I'm not sure how to accomplish that.

现在将单词嵌入模型用于第一组:

Now fit the word embedding model for the first set:


library("text2vec")

glove1 <- GlobalVectors$new(rank = 50, 
                            x_max = 10)

my_main1 <- glove1$fit_transform(my_fcm1, 
                               n_iter = 10,
                               convergence_tol = 0.01, 
                               n_threads = 8)

my_context1 <- glove1$components
word_vectors1 <- my_main1 + t(my_context1)

这是我遇到的问题,我想用第一个模型初始化第二个模型,以便坐标系统在第一个模型和第二个模型之间具有可比性.我阅读,其中w_iw_j是主要内容,上下文词,以及b_ib_j是偏差.我在第一个模型对象中找到了输出,但是出现错误:

And here is where I get stuck, I want to initialise the second model with the first, so that the coordinate system will be comparable between the first and second models. I read that w_i and w_j are main and context words, and b_i and b_j are biases. I've found output for those in my first model object, but I get an error:

glove2 <- GlobalVectors$new(rank = 50, 
                            x_max = 10,
                            init = list(w_i = glove1$.__enclos_env__$private$w_i, 
                                        b_i = glove1$.__enclos_env__$private$b_i, 
                                        w_j = glove1$.__enclos_env__$private$w_j, 
                                        b_j = glove1$.__enclos_env__$private$b_j))

my_main2 <- glove2$fit_transform(my_fcm2, 
                                 n_iter = 10,
                                 convergence_tol = 0.01, 
                                 n_threads = 8)

错误是Error in glove2$fit_transform(my_fcm2, n_iter = 10, convergence_tol = 0.01, : init values provided in the constructor don't match expected dimensions from the input matrix

The error is Error in glove2$fit_transform(my_fcm2, n_iter = 10, convergence_tol = 0.01, : init values provided in the constructor don't match expected dimensions from the input matrix

假设我在第一个模型中正确地识别了w_i等,我如何才能确保它们的大小正确?

Assuming I have identified w_i, etc., correctly in the first model, how can I get ensure they are the correct size?

这是我的会话信息:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] text2vec_0.6   quanteda_2.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4            pillar_1.4.3          compiler_3.6.0        tools_3.6.0           stopwords_1.0        
 [6] digest_0.6.25         packrat_0.5.0         lifecycle_0.2.0       tibble_3.0.0          gtable_0.3.0         
[11] lattice_0.20-40       pkgconfig_2.0.3       rlang_0.4.5           Matrix_1.2-18         fastmatch_1.1-0      
[16] cli_2.0.2             rstudioapi_0.11       mlapi_0.1.0           parallel_3.6.0        RhpcBLASctl_0.20-17  
[21] dplyr_0.8.5           vctrs_0.2.4           grid_3.6.0            tidyselect_1.0.0.9000 glue_1.3.2           
[26] data.table_1.12.8     R6_2.4.1              fansi_0.4.1           lgr_0.3.4             ggplot2_3.3.0        
[31] purrr_0.3.3           magrittr_1.5          scales_1.1.0          ellipsis_0.3.0        assertthat_0.2.1     
[36] float_0.2-3           rsparse_0.4.0         colorspace_1.4-1      stringi_1.4.6         RcppParallel_5.0.0   
[41] munsell_0.5.0         crayon_1.3.4.9000 

推荐答案

这是一个有效的示例.有关详细信息,请参见?rsparse::GloVe文档.

Here is a working example. See ?rsparse::GloVe documentation for details.

library(rsparse)
data("movielens100k")
x = crossprod(sign(movielens100k))

model = GloVe$new(rank = 10, x_max = 5)

w_i = model$fit_transform(x = x, n_iter = 5, n_threads = 1)
w_j = model$components
init = list(w_i = t(w_i), model$bias_i, w_j = w_j, b_j = model$bias_j)

model2 = GloVe$new(rank = 10, x_max = 10, init = init)
w_i2 = model2$fit_transform(x)

这篇关于如何用第一个解决方案初始化第二个手套模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆