使用R的文本挖掘程序包保留土耳其语字符 [英] Keeping Turkish characters with the text mining package for R

查看:107
本文介绍了使用R的文本挖掘程序包保留土耳其语字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先让我说我仍然是R的初学者。
目前,我正在尝试使用tm包尝试土耳其文本的基本文本挖掘技术。
但是,在R中显示土耳其语字符时遇到了问题。

let me start this by saying that I'm still pretty much a beginner with R. Currently I am trying out basic text mining techniques for Turkish texts, using the tm package. I have, however, encountered a problem with the display of Turkish characters in R.

这是我所做的:

docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")

我的想法是,将语言设置为土耳其语并将编码设置为UTF-8(这是文本文件的原始编码)应该可以显示土耳其语İ,ı,ğ,Ğ,ş和characters。取而代之的是,输出将这些字符分别转换为I,i,g,G,s和S,并将其保存为ANSI编码,无法显示这些字符。

My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))

还会保存不带ANSI编码字符的文件。

also saves the file without the characters in ANSI encoding.

这似乎不仅是输出的问题文件。

This seems to not only be an issue with the output file.

writeLines(as.character(docs[[1]])

例如,生成的行应显示为 Okul ve camiaçılışlarıumutlarıartırdı,而显示为 Okul ve camiaçilislariumutlari artirdi

for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"

阅读此内容后:将UTF-8文件输出到R
我还尝试了以下代码:

After reading this: UTF-8 file output in R I also tried the following code:

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)

其中没有更改结果。

所有这些都在Windows 7上,具有最新版本的R和RStudio。

All of this is on Windows 7 with both the most recent version of R and RStudio.

是否可以解决此问题?我可能缺少明显的东西,但是会有所帮助。

Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.

推荐答案

这是保持土耳其语字符完整的方法:

Here is how I keep the Turkish characters intact:


  1. 在RStudio中打开一个新的.Rmd文件。 (RStudio->文件->新文件-> R Markdown)

  2. 复制并粘贴包含土耳其语字符的文本。

  3. 保存.Rmd文件与编码。 (RStudio->文件-> 使用编码保存。.-> UTF-8)

  4. 您的文档<-readLines( yourdocument.Rmd, encoding = UTF-8

  5. 您的文档<-粘贴(您的文档,合拢=)

  6. 此后您可以创建语料库

  7. 例如从tm包中的VectorSource()开始。

  8. 土耳其语字符将按原样显示。

  1. Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
  2. Copy and Paste your text containing Turkish characters.
  3. Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
  4. yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8")
  5. yourdocument <- paste(yourdocument, collapse = " ")
  6. After this step you can create your corpus
  7. e.g start from VectorSource() in tm package.
  8. Turkish characters will appear as they should.

这篇关于使用R的文本挖掘程序包保留土耳其语字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆