在 R 中使用带有词条的 txt 文件进行词形还原 [英] Lemmatization using txt file with lemmes in R

查看:59
本文介绍了在 R 中使用带有词条的 txt 文件进行词形还原的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用波兰语引理结构如下的外部txt文件:(许多其他语言的引理来源http://www.lexiconista.com/datasets/lemmatization/)

I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/)

Abadan  Abadanem
Abadan  Abadanie
Abadan  Abadanowi
Abadan  Abadanu
abadańczyk  abadańczycy
abadańczyk  abadańczyka
abadańczyk  abadańczykach
abadańczyk  abadańczykami
abadańczyk  abadańczyki
abadańczyk  abadańczykiem
abadańczyk  abadańczykom
abadańczyk  abadańczyków
abadańczyk  abadańczykowi
abadańczyk  abadańczyku
abadanka    abadance
abadanka    abadanek
abadanka    abadanką
abadanka    abadankach
abadanka    abadankami

什么样的包和什么样的语法,可以让我使用这样的 txt 数据库来对我的词袋进行词形还原.我意识到,对于英语来说,有 Wordnet,但对于那些想将这个功能用于稀有语言的人来说,运气不好.

What packages and with what syntax, would allow me use such txt database to lemmatize my bag of words. I realize, for English there is Wordnet, but there is no luck for those who would like to use this functionality for rare languages.

如果没有,是否可以将此数据库转换为对任何提供词形还原的包有用?也许通过将其转换为宽格式?例如,免费的 AntConc concordancer 使用的表格,(http://www.laurenceanthony.net/software/antconc/)

If not, can this database be converted to be useful with any package that provides lemmatization? Perhaps by converting it to a wide form? For instance, the form used by free AntConc concordancer, (http://www.laurenceanthony.net/software/antconc/)

Abadan -> Abadanem, Abadanie, Abadanowi, Abadanu
abadańczyk -> abadańczycy, abadańczyka, abadańczykach 
etc.

简而言之:如何在任何已知的 CRAN R 文本挖掘包中使用 txt 文件中的词元进行词形还原?如果是这样,如何格式化这样的txt文件?

In brief: How can lemmatization with lemmas in txt file be done in any of the known CRAN R text mining packages ? If so, how to format such txt file?

更新:亲爱的@DmitriySelivanov 我去掉了所有变音符号,现在我想将它应用到 tm 语料库文档"

UPDATE: Dear @DmitriySelivanov I got rid of all diacritical marks, now I would like to apply it on tm corpus "docs"

docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")) 

我尝试将其用作标记器

LemmaTokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")

docsTDM <-
  DocumentTermMatrix(docs, control = list(wordLengths = c(4, 25), tokenize=LemmaTokenizer)) 

它向我抛出一个错误:

 Error in lemma_hashmap[[tokens]] : 
  attempt to select more than one element in vectorIndex 

该函数可以使用文本向量作为魅力.

The function works with a vector of texts as charm though.

推荐答案

我的猜测是这里与此任务的文本挖掘包无关.您只需将第二列中的单词替换为第一列中的单词即可.您可以通过创建哈希图(例如 https://github.com/nathan-russell/hashmap).

My guess is that here is nothing to do with text-mining packages for this task. You need just to replace word in a second column by word in a first column. You can do it with creating hashmap (for example https://github.com/nathan-russell/hashmap).

下面是如何创建词形还原"分词器的示例,您可以在 text2vec(我猜也是 quanteda)中轻松使用它.

Below is example of how you can create "lemmatizing" tokenizer which you can easily use in text2vec (and I guess quanteda as well).

贡献以创建这样的词形还原"包非常受欢迎 - 将非常有用.

Contributions in order to create such "lemmatizing" package are very welcome - will be very useful.

library(hashmap)
library(data.table)
txt = 
  "Abadan  Abadanem
  Abadan  Abadanie
  Abadan  Abadanowi
  Abadan  Abadanu
  abadańczyk  abadańczycy
  abadańczyk  abadańczykach
  abadańczyk  abadańczykami
  "
dt = fread(txt, header = F, col.names = c("lemma", "word"))
lemma_hm = hashmap(dt$word, dt$lemma)

lemma_hm[["Abadanu"]]
#"Abadan"


lemma_tokenizer = function(x, lemma_hashmap, 
                           tokenizer = text2vec::word_tokenizer) {
  tokens_list = tokenizer(x)
  for(i in seq_along(tokens_list)) {
    tokens = tokens_list[[i]]
    replacements = lemma_hashmap[[tokens]]
    ind = !is.na(replacements)
    tokens_list[[i]][ind] = replacements[ind]
  }
  tokens_list
}
texts = c("Abadanowi abadańczykach OutOfVocabulary", 
          "abadańczyk Abadan OutOfVocabulary")
lemma_tokenizer(texts, lemma_hm)

#[[1]]
#[1] "Abadan"          "abadańczyk"      "OutOfVocabulary"
#[[2]]
#[1] "abadańczyk"      "Abadan"          "OutOfVocabulary"

这篇关于在 R 中使用带有词条的 txt 文件进行词形还原的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆