使用R的文本挖掘程序包保留土耳其语字符 [英] Keeping Turkish characters with the text mining package for R
问题描述
首先让我说我仍然是R的初学者。
目前,我正在尝试使用tm包尝试土耳其文本的基本文本挖掘技术。
但是,在R中显示土耳其语字符时遇到了问题。
let me start this by saying that I'm still pretty much a beginner with R. Currently I am trying out basic text mining techniques for Turkish texts, using the tm package. I have, however, encountered a problem with the display of Turkish characters in R.
这是我所做的:
docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")
我的想法是,将语言设置为土耳其语并将编码设置为UTF-8(这是文本文件的原始编码)应该可以显示土耳其语İ,ı,ğ,Ğ,ş和characters。取而代之的是,输出将这些字符分别转换为I,i,g,G,s和S,并将其保存为ANSI编码,无法显示这些字符。
My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))
还会保存不带ANSI编码字符的文件。
also saves the file without the characters in ANSI encoding.
这似乎不仅是输出的问题文件。
This seems to not only be an issue with the output file.
writeLines(as.character(docs[[1]])
例如,生成的行应显示为 Okul ve camiaçılışlarıumutlarıartırdı,而显示为 Okul ve camiaçilislariumutlari artirdi
for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"
阅读此内容后:将UTF-8文件输出到R
我还尝试了以下代码:
After reading this: UTF-8 file output in R I also tried the following code:
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)
其中没有更改结果。
所有这些都在Windows 7上,具有最新版本的R和RStudio。
All of this is on Windows 7 with both the most recent version of R and RStudio.
是否可以解决此问题?我可能缺少明显的东西,但是会有所帮助。
Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.
推荐答案
这是保持土耳其语字符完整的方法:
Here is how I keep the Turkish characters intact:
- 在RStudio中打开一个新的.Rmd文件。 (RStudio->文件->新文件-> R Markdown)
- 复制并粘贴包含土耳其语字符的文本。
- 保存.Rmd文件与编码。 (RStudio->文件-> 使用编码保存。.-> UTF-8)
- 您的文档<-readLines( yourdocument.Rmd, encoding = UTF-8 )
- 您的文档<-粘贴(您的文档,合拢=)
- 此后您可以创建语料库
- 例如从tm包中的VectorSource()开始。
- 土耳其语字符将按原样显示。
- Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
- Copy and Paste your text containing Turkish characters.
- Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
- yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8")
- yourdocument <- paste(yourdocument, collapse = " ")
- After this step you can create your corpus
- e.g start from VectorSource() in tm package.
- Turkish characters will appear as they should.
这篇关于使用R的文本挖掘程序包保留土耳其语字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!