在Mac上将Unicode转换成Umlaut（Facebook数据） [英] Turn Unicode into Umlaut in R on Mac (Facebook Data)

查看：164 发布时间：2017/10/15 21:29:14 r facebook text unicode tm

本文介绍了在Mac上将Unicode转换成Umlaut（Facebook数据）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我做了很多研究，我仍然找不到这个解决方案。

我已经从德国Facebook群组中提取了

  from_ID from_name message created_time 
 12334543 Max Muster Dies war auch eine sehr sch< U + 00F6> ne Bucht 2016-0n08T19：00：54 + 0000

我明白< U + 00F6>
代表德国Umlatö。还有许多其他的Unicode代替德语Umlaute或其他语言特定符号（无论使用哪种语言）的例子。

无论我想做一个情感分析还是只是产生一个wordcloud我有时候有这个问题。在情绪的情况下，问题是训练数据不包含这些Unicodes，因此预测/分类出错。在其他基于文本的程序的情况下，如删除字词的文本清理是一个问题，因为停止单词列表也是干净的，并且不具有这些代码。

有没有任何容易方法摆脱这个并使R显示相应的符号而不是代码？

我尝试了很多。我最后的手段将是一个gsub例程。但是，我的数据框包含超过100万条评论。此外，gsub会非常痛苦，因为似乎有太多的Unicodes（如果我们想到比德语更多的语言）。

如果我知道它是正确的也很重要我正在使用什么样的电脑它是一个MacBook Pro。

这里的任何帮助真的非常感谢！

非常感谢您的时间和帮助。

解决方案

这有点神秘，但是这样做可以实现：

  message<  -  c（Dies war auch eine sehr sch Schlo< U + 00DF> Sch< U + 00F6> nbrunn）
 
＃转换< U + 00xx&格式转换为R的\\u00xx格式用于转义的Unicode 
 message2<  -  stringi :: stri_replace_all_fixed（message，c（），c（\\ ，），vectorize_all = FALSE）
 
＃通过解析和强制转换为本机
 as.character（parse（text = shQuote（message2）））
 ## [ 1]Dies war auch eine sehrschöneBuchtSchloßSchönbrunn。

I did a lot of research on this and I still can't find a solution to this.

I have extracted data from a German Facebook group that looks like

from_ID         from_name           message                                        created_time
12334543        Max Muster          Dies war auch eine sehr sch<U+00F6>ne Bucht    2016-01-08T19:00:54+0000

I understand that <U+00F6> stands for the German Umlat ö. There are many other examples of Unicode replacing German Umlaute or other language specifc signs (no matter which language).

No matter if I want to do a sentiment analysis or just produce a wordcloud I sometimes have issues with this. In case of the sentiment an issue is that training data is not containing these Unicodes and hence the prediction/classification goes wrong. In case of other text based procedures text cleaning like stopword removal is a problem as stop word lists are also "clean" and do not feature these codes.

Is there any easy way to get rid of this and to make R display the corresponding sign instead of the code?

I tried a lot. My last resort would be a gsub routine. However my data frame includes more than 1 million comments. In addition gsub would be very painful as there seems to be too many Unicodes (if we think of more languages than German).

If I got it right it is also important what kind of computer I am using. It is a MacBook Pro.

Any help here is really really appreciated!!

Thank you a lot for your time and help!

解决方案

It's a bit mystifying, but this will do it:

message <- c("Dies war auch eine sehr sch<U+00F6>ne Bucht", 
             "Schlo<U+00DF> Sch<U+00F6>nbrunn.")

# convert the <U+00xx> format to R's \\u00xx format for escaped Unicode
message2 <- stringi::stri_replace_all_fixed(message, c("<U+", ">"), c("\\u", ""), vectorize_all = FALSE)

# convert to native through parsing and coercing
as.character(parse(text = shQuote(message2)))
## [1] "Dies war auch eine sehr schöne Bucht" "Schloß Schönbrunn."

这篇关于在Mac上将Unicode转换成Umlaut（Facebook数据）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Mac上将Unicode转换成Umlaut（Facebook数据） [英] Turn Unicode into Umlaut in R on Mac (Facebook Data)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Mac上将Unicode转换成Umlaut（Facebook数据） [英] Turn Unicode into Umlaut in R on Mac (Facebook Data)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭