带有语料库的DocumentTermMatrix中的德语问题 [英] Issue in DocumentTermMatrix with corpus in German

查看：152 发布时间：2021/5/4 19:16:26 r encoding utf-8 tm

本文介绍了带有语料库的DocumentTermMatrix中的德语问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用软件包tm在R中创建了一个语料库，指定了语言和编码，如下所示:

I created a corpus in R using package tm specifying language and encoding as follows:

de_DE.corpus <- Corpus(VectorSource(de_DE.sample), readerControl
    = list(language="de_DE",encoding = "UTF_8"))
de_DE.corpus[36]$content
de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list
    (encoding = 'UTF-8'))
inspect(de_DE.dtm[, grepl("grÃ", de_DE.dtm$dimnames$Terms)])
inspect(de_DE.dtm[36, ])

如果我在文档36的 de_DE.corpus [36] $ content 中看到带有ü"字样的内容，则该文本将正确显示.例如"...单身主义者死于BegründungderBehördeEine ..."

If I see the content in de_DE.corpus[36]$content of document 36 which has 'ü' the text is shown correctly. e.g. " ...Single ist so die Begründung der Behörde Eine... "

但是，当我创建 DocumentTermMatrix 时(我尝试了多种编码和语言选项)，我得到的单词是begrÃ"，例如Begründung"一词.在执行 inspect(de_DE.dtm [36，])后查看结果.

But when I create the DocumentTermMatrix (I tried multiple options for encoding and language) I am getting words like "begrÃ" where for example is the word "Begründung". See result after executing inspect(de_DE.dtm[36, ]).

<<DocumentTermMatrix (documents: 1, terms: 21744)>>

Non-/sparse entries: 102/21642

Sparsity : 100%

Maximal term length: 43

Weighting : term frequency (tf)

Sample :

Terms

Docs begrÃ das dem der die eine einen jobcenter und zum

36     3    4   2  4   8     2    2       4       3  3

如果有人知道如何解决该问题，我将不胜感激.在此先感谢:)

I would appreciate if someone knows how to fix the problem. Thanks in advance :)

推荐答案

可以检查输入的数据吗?因为您的代码对我有用.因此，我认为将其加载到de_DE.sample中时遇到问题.

Can you check your input data? Because your code works for me. So I think you have an issue when you are loading it already in de_DE.sample.

doc<-c("Single ist so die Begründung der Behörde Eine", "Single Begründung Behörde ")

de_DE.corpus <- Corpus(VectorSource(doc), readerControl
                       = list(language="de_DE",encoding = "UTF_8"))
de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list
                                (encoding = 'UTF-8'))

inspect(de_DE.dtm[1, ])
<<DocumentTermMatrix (documents: 1, terms: 7)>>
Non-/sparse entries: 7/0
Sparsity           : 0%
Maximal term length: 10
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs begründung behörde der die eine ist single
   1          1       1   1   1    1   1      1

这篇关于带有语料库的DocumentTermMatrix中的德语问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

带有语料库的DocumentTermMatrix中的德语问题 [英] Issue in DocumentTermMatrix with corpus in German

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

带有语料库的DocumentTermMatrix中的德语问题 [英] Issue in DocumentTermMatrix with corpus in German

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭