使用 twitteR R 包进行西里尔文推文的编码问题 [英] encoding issue using twitteR R package for Cyrillic tweets

查看:22
本文介绍了使用 twitteR R 包进行西里尔文推文的编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过 twitteR 包解析西里尔文推文.

I want to parse Cyrillic tweets via twitteR package.

我运行这个简单的代码来获取最后 5 条推文:

I run this simple code to get last 5 tweets:

> library("twitteR")
> tweets=userTimeline(user="ru_mts",n=100)
> tweets[1:5]

输出如下.我应该怎么做才能使用它?肯定有一些与编码有关的东西.谢谢.

The output is below. What should I do to get it usable? There is definitely something with encoding. Thanks.

[[1]]
[1] "ru_mts: @potemkink \037@8 ?@52KH5=88 ;8<8B>2 B@0D8:0 459AB2CNB >3@0=8G5=8O A:>@>AB8. \025ABL CA;C38, =0 :>B>@KE ;8<8B 1>;LH5: http://t.co/EgbYhwfx. #\034\"!"

[[2]]
[1] "ru_mts: @step_42, C40;5=85 8=D-O > ?@52KH5=88 ;8<8B0 \021\030\" ?@>872>48BAO G5@57 *111*219# 2K7>2, 8;8 A<A A B5:AB>< stop =0 5340. \0215A?;0B=>. ^\030\020 #\034\"!"

[[3]]
[1] "ru_mts: @d_kosmos, 2 A;CG05 5A;8 C \0220A =5 ?>;CG05BAO 2>A?>;L7>20BLAO CA;C3>9 \03353:89 ?;0B56, @5:><5=4C5< 2>A?>;L7>20BLAO ?>765. ^\030\020 #\034\"!"

[[4]]
[1] "ru_mts: @d_kosmos, ?@54>AB02LB5 ?>60;C9AB0 \0220H \026B5; \034\"!, =8: 2 B28, =0 blogs@mts.ru \037@>25@8< 8=D>@<0F8N ?> B0@8DC, CA;C30< 8 1>=CA0<. ^\030\020 #\034\"!"

[[5]]
[1] "ru_mts: @katmirabo \034>6=> CB>G=8BL ?@8G8=C A?8A0=89 87 45B0;870F88 2 \030=B5@=5B-\037><>I=8:5: http://t.co/3ydhKfPL 8;8 ?>72>=82 ?> \0260890. ^\030\020 #\034\"!"

这里是 sessionInfo()

Here is the sessionInfo()

R version 2.14.0 (2011-10-31)

Platform: i386-pc-mingw32/i386 (32-bit)


locale:

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    
LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] Unicode_0.1-3  twitteR_0.99.9 RJSONIO_0.95-0 RCurl_1.6-10.1 bitops_1.0-4.1
loaded via a namespace (and not attached):
[1] tools_2.14.0

推荐答案

问题实际上存在于 RJSONIO::fromJSON 和 RCurl::getURL 中,它们正在/正在剥离 'UTF-8' 编码.

The issue actually resides with RJSONIO::fromJSON and RCurl::getURL which are/were striping out the 'UTF-8' encoding.

RJSONIO::fromJSON() 不用于保留编码,但如果您更新到 RJSONIO 0.96-0 会保留.

RJSONIO::fromJSON() didn't used to preserve the encoding but does if you update to RJSONIO 0.96-0.

Duncan 目前正在研究 RCurl::getURL 的编码问题(它使用正确的编码来创建字符向量元素,但随后发生了一些奇怪的事情).

Duncan is currently looking into the encoding issue for RCurl::getURL (it uses the correct encoding to create the character vector element but then something odd happens).

简短的回答是将 RJSONIO 更新到 0.96-0,然后在下一个版本发布并修复时更新 RCurl.

Short answer is to update RJSONIO to 0.96-0 and then update RCurl when the next version is released with a fix.

这篇关于使用 twitteR R 包进行西里尔文推文的编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆