Quanteda:用词典中的引理替换令牌的最快方法? [英] Quanteda: Fastest way to replace tokens with lemma from dictionary?
问题描述
是否有比Rquanteda :: tokens_lookup()更快的替代方案?
Is there a much faster alternative to R quanteda::tokens_lookup()?
我在'quanteda'R包中使用tokens()对带有2000个文档的数据框进行令牌化.每个文档为50-600字.在我的PC(Microsoft R Open 3.4.1,Intel MKL(使用2个内核))上,这需要花费几秒钟的时间.
I use tokens() in the 'quanteda' R package to tokenize a data frame with 2000 documents. Each document is 50 - 600 words. This takes a couple of seconds on my PC (Microsoft R Open 3.4.1, Intel MKL (using 2 cores)).
我有一个字典对象,它由将近60万个单词(TERMS)及其对应的引理(PARENT)构成的数据帧组成.有80 000个不同的引理.
I have a dictionary object, made from a data frame of nearly 600 000 words (TERMS) and their corresponding lemma (PARENT). There are 80 000 distinct lemmas.
我使用tokens_lookup()将token-list中的元素替换为在字典中找到的词条.但这至少需要1.5个小时. 此功能对我的问题来说太慢了.有没有一种更快的方法,同时仍然可以获得令牌列表?
I use tokens_lookup() to replace the elements in the token-list by their lemmas found in the dictionary. But this takes at least 1,5 hours. This function is TOO slow for my problem. Is there a quicker way, while still getting a token list?
我想直接转换令牌列表,以便在使用字典后使ngram更好.如果我只想要一个字母,我可以很容易地通过将文档功能矩阵与字典连接起来来做到这一点.
I want to transform the token list directly, to be make ngrams AFTER using the dictionary. If I only wanted onegrams I could easily have done this by joining the document-feature matrix with the dictionary.
如何更快地做到这一点?将令牌列表转换为数据框,与字典连接,再转换回有序令牌列表?
How can I do this faster? Convert token list to data frame, join with dictionary, convert back to ordered token list?
这是示例代码:
library(quanteda)
myText <- c("the man runs home", "our men ran to work")
myDF <- data.frame(myText)
myDF$myText <- as.character(myDF$myText)
tokens <- tokens(myDF$myText, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
tokens
# tokens from 2 documents.
# text1 :
# [1] "the" "man" "runs" "home"
#
# text2 :
# [1] "our" "men" "ran" "to" "work"
term <- c("man", "men", "woman", "women", "run", "runs", "ran")
lemma <- c("human", "human", "human", "humen", "run", "run", "run")
dict_df <- data.frame(TERM=term, LEMMA=lemma)
dict_df
# TERM LEMMA
# 1 man human
# 2 men human
# 3 woman human
# 4 women humen
# 5 run run
# 6 runs run
# 7 ran run
dict_list <- list( "human" = c("man", "men", "woman", "women") , "run" = c("run", "runs", "ran"))
dict <- quanteda::dictionary(dict_list)
dict
# Dictionary object with 2 key entries.
# - human:
# - man, men, woman, women
# - run:
# - run, runs, ran
tokens_lemma <- tokens_lookup(tokens, dictionary=dict, exclusive = FALSE, capkeys = FALSE)
tokens_lemma
#tokens from 2 documents.
# text1 :
# [1] "the" "human" "run" "home"
#
# text2 :
# [1] "our" "human" "run" "to" "work"
tokens_ngrams <- tokens_ngrams(tokens_lemma, n = 1:2)
tokens_ngrams
#tokens from 2 documents.
# text1 :
# [1] "the" "human" "run" "home" "the_human" "human_run" "run_home"
#
# text2 :
# [1] "our" "human" "run" "to" "work" "our_human" "human_run" "run_to" "to_work"
推荐答案
我没有引言列表来对自己进行基准测试,但这是隐蔽令牌类型的最快方法.请尝试让我知道需要多长时间(应该在几秒钟内完成).
I don't have a lemma list to benchmark myself, but this is the fastest way to covert token types. Please try and let me know how long it takes (should be done in a few seconds).
tokens_convert <- function(x, from, to) {
type <- attr(x, 'types')
type_new <- to[match(type, from)]
type_new <- ifelse(is.na(type_new), type, type_new)
attr(x, 'types') <- type_new
quanteda:::tokens_recompile(x)
}
tokens_convert(tokens, dict_df$TERM, dict_df$LEMMA)
这篇关于Quanteda:用词典中的引理替换令牌的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!