在Mac上将Unicode转换成Umlaut(Facebook数据) [英] Turn Unicode into Umlaut in R on Mac (Facebook Data)
问题描述
我已经从德国Facebook群组中提取了
from_ID from_name message created_time
12334543 Max Muster Dies war auch eine sehr sch< U + 00F6> ne Bucht 2016-0n08T19:00:54 + 0000
我明白< U + 00F6>
代表德国Umlatö。还有许多其他的Unicode代替德语Umlaute或其他语言特定符号(无论使用哪种语言)的例子。
无论我想做一个情感分析还是只是产生一个wordcloud我有时候有这个问题。在情绪的情况下,问题是训练数据不包含这些Unicodes,因此预测/分类出错。在其他基于文本的程序的情况下,如删除字词的文本清理是一个问题,因为停止单词列表也是干净的,并且不具有这些代码。
有没有任何容易方法摆脱这个并使R显示相应的符号而不是代码?
我尝试了很多。我最后的手段将是一个gsub例程。但是,我的数据框包含超过100万条评论。此外,gsub会非常痛苦,因为似乎有太多的Unicodes(如果我们想到比德语更多的语言)。
如果我知道它是正确的也很重要我正在使用什么样的电脑它是一个MacBook Pro。
这里的任何帮助真的非常感谢!
非常感谢您的时间和帮助。
这有点神秘,但是这样做可以实现:
message< - c(Dies war auch eine sehr sch Schlo< U + 00DF> Sch< U + 00F6> nbrunn)
#转换< U + 00xx&格式转换为R的\\u00xx格式用于转义的Unicode
message2< - stringi :: stri_replace_all_fixed(message,c(),c(\\ ,),vectorize_all = FALSE)
#通过解析和强制转换为本机
as.character(parse(text = shQuote(message2)))
## [ 1]Dies war auch eine sehrschöneBuchtSchloßSchönbrunn。
I did a lot of research on this and I still can't find a solution to this.
I have extracted data from a German Facebook group that looks like
from_ID from_name message created_time
12334543 Max Muster Dies war auch eine sehr sch<U+00F6>ne Bucht 2016-01-08T19:00:54+0000
I understand that <U+00F6>
stands for the German Umlat ö. There are many other examples of Unicode replacing German Umlaute or other language specifc signs (no matter which language).
No matter if I want to do a sentiment analysis or just produce a wordcloud I sometimes have issues with this. In case of the sentiment an issue is that training data is not containing these Unicodes and hence the prediction/classification goes wrong. In case of other text based procedures text cleaning like stopword removal is a problem as stop word lists are also "clean" and do not feature these codes.
Is there any easy way to get rid of this and to make R display the corresponding sign instead of the code?
I tried a lot. My last resort would be a gsub routine. However my data frame includes more than 1 million comments. In addition gsub would be very painful as there seems to be too many Unicodes (if we think of more languages than German).
If I got it right it is also important what kind of computer I am using. It is a MacBook Pro.
Any help here is really really appreciated!!
Thank you a lot for your time and help!
It's a bit mystifying, but this will do it:
message <- c("Dies war auch eine sehr sch<U+00F6>ne Bucht",
"Schlo<U+00DF> Sch<U+00F6>nbrunn.")
# convert the <U+00xx> format to R's \\u00xx format for escaped Unicode
message2 <- stringi::stri_replace_all_fixed(message, c("<U+", ">"), c("\\u", ""), vectorize_all = FALSE)
# convert to native through parsing and coercing
as.character(parse(text = shQuote(message2)))
## [1] "Dies war auch eine sehr schöne Bucht" "Schloß Schönbrunn."
这篇关于在Mac上将Unicode转换成Umlaut(Facebook数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!