使用R的文本挖掘程序包保留土耳其语字符 [英] Keeping Turkish characters with the text mining package for R

查看：107 发布时间：2020/10/29 6:44:18 r encoding utf-8 tm

本文介绍了使用R的文本挖掘程序包保留土耳其语字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

首先让我说我仍然是R的初学者。
目前，我正在尝试使用tm包尝试土耳其文本的基本文本挖掘技术。
但是，在R中显示土耳其语字符时遇到了问题。

let me start this by saying that I'm still pretty much a beginner with R. Currently I am trying out basic text mining techniques for Turkish texts, using the tm package. I have, however, encountered a problem with the display of Turkish characters in R.

这是我所做的：

docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")

我的想法是，将语言设置为土耳其语并将编码设置为UTF-8（这是文本文件的原始编码）应该可以显示土耳其语İ，ı，ğ，Ğ，ş和characters。取而代之的是，输出将这些字符分别转换为I，i，g，G，s和S，并将其保存为ANSI编码，无法显示这些字符。

My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))

还会保存不带ANSI编码字符的文件。

also saves the file without the characters in ANSI encoding.

这似乎不仅是输出的问题文件。

This seems to not only be an issue with the output file.

writeLines(as.character(docs[[1]])

例如，生成的行应显示为 Okul ve camiaçılışlarıumutlarıartırdı，而显示为 Okul ve camiaçilislariumutlari artirdi

for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"

阅读此内容后：将UTF-8文件输出到R
我还尝试了以下代码：

After reading this: UTF-8 file output in R I also tried the following code:

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)

其中没有更改结果。

所有这些都在Windows 7上，具有最新版本的R和RStudio。

All of this is on Windows 7 with both the most recent version of R and RStudio.

是否可以解决此问题？我可能缺少明显的东西，但是会有所帮助。

Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.

使用R的文本挖掘程序包保留土耳其语字符 [英] Keeping Turkish characters with the text mining package for R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用R的文本挖掘程序包保留土耳其语字符 [英] Keeping Turkish characters with the text mining package for R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭