使用 TermDocumentMatrix 进行 UTF-8 字符编码 [英] UTF-8 Character Encoding with TermDocumentMatrix

查看:33
本文介绍了使用 TermDocumentMatrix 进行 UTF-8 字符编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力学习 R.我一直试图解决这个问题几个小时.我已经搜索并尝试了很多方法来解决这个问题,但到目前为止没有运气.所以我们开始了;我正在从 twitter(通过 twitteR)下载一些随机推文.当我检查我的数据框时,我可以看到所有特殊字符(例如;üğıİşçÇöÖ).我正在删除一些东西(如空格等).毕竟删除和操作我的语料库,一切看起来都很好.当我尝试创建 TermDocumentMatrix 时,字符编码问题就开始了.在那之后tdm"和df"有一些奇怪的符号并且可能丢失了一些字符?这是代码;

I'm trying to learn R. I've been trying to solve this problem for hours. I've searched and tried lots of things to fix this but no luck so far. So here we go; I'm downloading some random tweets from twitter (via twitteR). I can see all special characters when i check my dataframe (like; üğıİşçÇöÖ). I'm removing some stuff (like whitespace etc.) After all removing and manipulating my corpus everything looks fine. Character encoding problem starts when i try to create TermDocumentMatrix. After that "tdm" and "df" has some weird symbols and maybe lost some characters?? Here is the code;

tweetsg.df <- twListToDF(tweets)
#looks good. no encoding problems.
wordCorpus <- Corpus(VectorSource(tweetsg.df$text))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
#wordCorpus looks fine at this point.
tdm <- TermDocumentMatrix(wordCorpus, control = list(tokenize="scan", 
wordLengths = c(3, Inf),language="Turkish"))
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 1)
df <- data.frame(term = names(term.freq), freq = term.freq)

此时 tdm 和 df 都有奇怪的符号和缺失的字符.

At this point both tdm and df has weird symbols and missing characters.

  • 尝试使用不同的分词器.也是定制的.
  • 将 Sys.setLocale 更改为我自己的语言.
  • 使用 enc2utf8
  • 将我的系统(Windows 10)显示语言更改为我自己的语言

但仍然没有运气!接受任何类型的帮助或指示:)PS:非英语人士和 R 新手在这里.另外,如果我们能解决这个问题,我想我的表情符号也有问题.我想删除甚至更好地使用它们:)

Still no luck though! Any kind of help or pointers accepted :) PS: Non-english speaker AND R newbie here. Also if we can solve this i think i have a problem with emojis too. I would like to remove or even better USE them :)

推荐答案

我已经设法复制了您的问题,并进行了更改以获得土耳其语输出.尝试更改线路

I've managed to duplicate your issue, and make changes to get Turkish output. Try changing the line

wordCorpus <- Corpus(VectorSource(tweetsg.df$text))

wordCorpus <- Corpus(DataframeSource(data.frame(tweetsg.df$text)))

并添加与此类似的行.

Encoding(tweetsg.df$text)  <- "UTF-8"

我开始工作的代码是

library(tm)
sampleTurkish <- "değiştirdik değiştirdik değiştirdik"
Encoding(sampleTurkish)  <- "UTF-8"
#looks good. no encoding problems.
wordCorpus <- Corpus(DataframeSource(data.frame(sampleTurkish)))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
#wordCorpus looks fine at this point.
tdm <- TermDocumentMatrix(wordCorpus)
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 1)
df <- data.frame(term = names(term.freq), freq = term.freq)

print(findFreqTerms(tdm, lowfreq=2))

这只适用于来自控制台的 source 命令.即单击 RStudio 中的运行或源按钮不起作用.我还确保我选择了使用编码保存"UTF-8"(尽管这可能只是必要的,因为我有土耳其语文本)

This only worked with a source command from the console. i.e. clicking on run or source button in RStudio didn't work. I also made sure I chose "Save with Encoding" "UTF-8" (although this is probably only necessary because I have turkish text)

> source("Turkish.R")
[1] "değiştirdik"

这是第二个答案R tm package: utf-8 text这最终很有用.

It was the second answer R tm package: utf-8 text that was useful in the end.

这篇关于使用 TermDocumentMatrix 进行 UTF-8 字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆