将Unicode Emoji正确读入R [英] Reading in Unicode Emoji correctly into R
本文介绍了将Unicode Emoji正确读入R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
例如:我有一个.csv(以UTF-8编码),它将有一个消息行,其中包含以下内容:
"这是正确的吗!?!?!请说这不是真的!我们家只吃原汁原味的瑞斯花生酱杯💚💚💚"
然后我以以下方式将其摄取到R中:
library(tidyverse)
library(janitor)
raw.fb.comments <- read_csv("data.csv",
locale = locale(encoding="UTF-8"))
fb.comments <- raw.fb.comments %>%
clean_names() %>%
filter(senderscreenname != "Reese's") %>%
select(c(message,messagetype,sentiment)) %>%
mutate(type = "Facebook")
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsxf0u009fu0092u009axf0u009fu0092u009axf0u009fu0092u009a
"
现在,根据我从其他来源了解的情况,我需要将这个UTF-8转换为ASCII,然后我可以使用ASCII将其与其他emoji资源(如精彩的emojidictionary)链接起来。要使联接工作,我需要将此代码转换为R编码,如下所示:
<e2><9d><a4><ef><b8><8f>
但是,添加普通步骤(使用iconv
)并不能达到目的:
fb.comments <- raw.fb.comments %>%
clean_names() %>%
filter(senderscreenname != "Reese's") %>%
select(c(message,messagetype,sentiment)) %>%
mutate(type = "Facebook") %>%
mutate(message = iconv(message, from="UTF-8", to="ascii",sub="byte"))
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups<f0><9f><92><9a><f0><9f><92><9a><f0><9f><92><9a>
"
有人能告诉我我错过了什么吗,或者我需要找一个不同的表情符号映射资源吗?谢谢!
推荐答案
目标不是很明确,但我怀疑放弃表示表情符号的正确性,而只是将其表示为字节不是最好的方式。例如,如果您希望将emoji转换为他们的描述,您可以这样做:
x <- "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups💚💚💚"
## read emoji info and get rid of documentation lines
readLines("https://unicode.org/Public/emoji/5.0/emoji-test.txt",
encoding="UTF-8") %>%
stri_subset_regex(pattern = "^[^#]") %>%
stri_subset_regex(pattern = ".+") -> emoji
## get the emoji characters and clean them up
emoji %>%
stri_extract_all_regex(pattern = "# *.{1,2} *") %>%
stri_replace_all_fixed(pattern = c("*", "#"),
replacement = "",
vectorize_all=FALSE) %>%
stri_trim_both() -> emoji.chars
## get the emoji character descriptions
emoji %>%
stri_extract_all_regex(pattern = "#.*$") %>%
stri_replace_all_regex(pattern = "# *.{1,2} *",
replacement = "") %>%
stri_trim_both() -> emoji.descriptions
## replace emoji characters with their descriptions.
stri_replace_all_regex(x,
pattern = emoji.chars,
replacement = emoji.descriptions,
vectorize_all=FALSE)
## [1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsgreen heartgreen heartgreen heart"
这篇关于将Unicode Emoji正确读入R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文