在 R 中使用带有词条的 txt 文件进行词形还原 [英] Lemmatization using txt file with lemmes in R
问题描述
我想使用波兰语引理结构如下的外部txt文件:(许多其他语言的引理来源http://www.lexiconista.com/datasets/lemmatization/)
I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/)
Abadan Abadanem
Abadan Abadanie
Abadan Abadanowi
Abadan Abadanu
abadańczyk abadańczycy
abadańczyk abadańczyka
abadańczyk abadańczykach
abadańczyk abadańczykami
abadańczyk abadańczyki
abadańczyk abadańczykiem
abadańczyk abadańczykom
abadańczyk abadańczyków
abadańczyk abadańczykowi
abadańczyk abadańczyku
abadanka abadance
abadanka abadanek
abadanka abadanką
abadanka abadankach
abadanka abadankami
什么样的包和什么样的语法,可以让我使用这样的 txt 数据库来对我的词袋进行词形还原.我意识到,对于英语来说,有 Wordnet,但对于那些想将这个功能用于稀有语言的人来说,运气不好.
What packages and with what syntax, would allow me use such txt database to lemmatize my bag of words. I realize, for English there is Wordnet, but there is no luck for those who would like to use this functionality for rare languages.
如果没有,是否可以将此数据库转换为对任何提供词形还原的包有用?也许通过将其转换为宽格式?例如,免费的 AntConc concordancer 使用的表格,(http://www.laurenceanthony.net/software/antconc/)
If not, can this database be converted to be useful with any package that provides lemmatization? Perhaps by converting it to a wide form? For instance, the form used by free AntConc concordancer, (http://www.laurenceanthony.net/software/antconc/)
Abadan -> Abadanem, Abadanie, Abadanowi, Abadanu
abadańczyk -> abadańczycy, abadańczyka, abadańczykach
etc.
简而言之:如何在任何已知的 CRAN R 文本挖掘包中使用 txt 文件中的词元进行词形还原?如果是这样,如何格式化这样的txt文件?
In brief: How can lemmatization with lemmas in txt file be done in any of the known CRAN R text mining packages ? If so, how to format such txt file?
更新:亲爱的@DmitriySelivanov 我去掉了所有变音符号,现在我想将它应用到 tm 语料库文档"
UPDATE: Dear @DmitriySelivanov I got rid of all diacritical marks, now I would like to apply it on tm corpus "docs"
docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm"))
我尝试将其用作标记器
LemmaTokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")
docsTDM <-
DocumentTermMatrix(docs, control = list(wordLengths = c(4, 25), tokenize=LemmaTokenizer))
它向我抛出一个错误:
Error in lemma_hashmap[[tokens]] :
attempt to select more than one element in vectorIndex
该函数可以使用文本向量作为魅力.
The function works with a vector of texts as charm though.
推荐答案
我的猜测是这里与此任务的文本挖掘包无关.您只需将第二列中的单词替换为第一列中的单词即可.您可以通过创建哈希图(例如 https://github.com/nathan-russell/hashmap).
My guess is that here is nothing to do with text-mining packages for this task. You need just to replace word in a second column by word in a first column. You can do it with creating hashmap (for example https://github.com/nathan-russell/hashmap).
下面是如何创建词形还原"分词器的示例,您可以在 text2vec(我猜也是 quanteda)中轻松使用它.
Below is example of how you can create "lemmatizing" tokenizer which you can easily use in text2vec (and I guess quanteda as well).
贡献以创建这样的词形还原"包非常受欢迎 - 将非常有用.
Contributions in order to create such "lemmatizing" package are very welcome - will be very useful.
library(hashmap)
library(data.table)
txt =
"Abadan Abadanem
Abadan Abadanie
Abadan Abadanowi
Abadan Abadanu
abadańczyk abadańczycy
abadańczyk abadańczykach
abadańczyk abadańczykami
"
dt = fread(txt, header = F, col.names = c("lemma", "word"))
lemma_hm = hashmap(dt$word, dt$lemma)
lemma_hm[["Abadanu"]]
#"Abadan"
lemma_tokenizer = function(x, lemma_hashmap,
tokenizer = text2vec::word_tokenizer) {
tokens_list = tokenizer(x)
for(i in seq_along(tokens_list)) {
tokens = tokens_list[[i]]
replacements = lemma_hashmap[[tokens]]
ind = !is.na(replacements)
tokens_list[[i]][ind] = replacements[ind]
}
tokens_list
}
texts = c("Abadanowi abadańczykach OutOfVocabulary",
"abadańczyk Abadan OutOfVocabulary")
lemma_tokenizer(texts, lemma_hm)
#[[1]]
#[1] "Abadan" "abadańczyk" "OutOfVocabulary"
#[[2]]
#[1] "abadańczyk" "Abadan" "OutOfVocabulary"
这篇关于在 R 中使用带有词条的 txt 文件进行词形还原的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!