在Mac上将Unicode转换成Umlaut(Facebook数据) [英] Turn Unicode into Umlaut in R on Mac (Facebook Data)

查看:164
本文介绍了在Mac上将Unicode转换成Umlaut(Facebook数据)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我做了很多研究,我仍然找不到这个解决方案。



我已经从德国Facebook群组中提取了

  from_ID from_name message created_time 
12334543 Max Muster Dies war auch eine sehr sch< U + 00F6> ne Bucht 2016-0n08T19:00:54 + 0000

我明白< U + 00F6>

代表德国Umlatö。还有许多其他的Unicode代替德语Umlaute或其他语言特定符号(无论使用哪种语言)的例子。



无论我想做一个情感分析还是只是产生一个wordcloud我有时候有这个问题。在情绪的情况下,问题是训练数据不包含这些Unicodes,因此预测/分类出错。在其他基于文本的程序的情况下,如删除字词的文本清理是一个问题,因为停止单词列表也是干净的,并且不具有这些代码。



有没有任何容易方法摆脱这个并使R显示相应的符号而不是代码?



我尝试了很多。我最后的手段将是一个gsub例程。但是,我的数据框包含超过100万条评论。此外,gsub会非常痛苦,因为似乎有太多的Unicodes(如果我们想到比德语更多的语言)。



如果我知道它是正确的也很重要我正在使用什么样的电脑它是一个MacBook Pro。



这里的任何帮助真的非常感谢!



非常感谢您的时间和帮助。

解决方案

这有点神秘,但是这样做可以实现:

  message<  -  c(Dies war auch eine sehr sch Schlo< U + 00DF> Sch< U + 00F6> nbrunn)

#转换< U + 00xx&格式转换为R的\\u00xx格式用于转义的Unicode
message2< - stringi :: stri_replace_all_fixed(message,c(),c(\\ ,),vectorize_all = FALSE)

#通过解析和强制转换为本机
as.character(parse(text = shQuote(message2)))
## [ 1]Dies war auch eine sehrschöneBuchtSchloßSchönbrunn。


I did a lot of research on this and I still can't find a solution to this.

I have extracted data from a German Facebook group that looks like

from_ID         from_name           message                                        created_time
12334543        Max Muster          Dies war auch eine sehr sch<U+00F6>ne Bucht    2016-01-08T19:00:54+0000

I understand that <U+00F6> stands for the German Umlat ö. There are many other examples of Unicode replacing German Umlaute or other language specifc signs (no matter which language).

No matter if I want to do a sentiment analysis or just produce a wordcloud I sometimes have issues with this. In case of the sentiment an issue is that training data is not containing these Unicodes and hence the prediction/classification goes wrong. In case of other text based procedures text cleaning like stopword removal is a problem as stop word lists are also "clean" and do not feature these codes.

Is there any easy way to get rid of this and to make R display the corresponding sign instead of the code?

I tried a lot. My last resort would be a gsub routine. However my data frame includes more than 1 million comments. In addition gsub would be very painful as there seems to be too many Unicodes (if we think of more languages than German).

If I got it right it is also important what kind of computer I am using. It is a MacBook Pro.

Any help here is really really appreciated!!

Thank you a lot for your time and help!

解决方案

It's a bit mystifying, but this will do it:

message <- c("Dies war auch eine sehr sch<U+00F6>ne Bucht", 
             "Schlo<U+00DF> Sch<U+00F6>nbrunn.")

# convert the <U+00xx> format to R's \\u00xx format for escaped Unicode
message2 <- stringi::stri_replace_all_fixed(message, c("<U+", ">"), c("\\u", ""), vectorize_all = FALSE)

# convert to native through parsing and coercing
as.character(parse(text = shQuote(message2)))
## [1] "Dies war auch eine sehr schöne Bucht" "Schloß Schönbrunn." 

这篇关于在Mac上将Unicode转换成Umlaut(Facebook数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆