表情符号在R [UTF-8编码] [英] Emoji in R [UTF-8 encoding]

查看:805
本文介绍了表情符号在R [UTF-8编码]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在R.
上做一个表情符号分析我已经存储了一些有emojis的tweets。

I'm trying to make an emoji analysis on R. I have stored some tweets where there are emojis.

以下是我要分析的推文之一:

Here is one of the tweet that I want to analyze :

> tweetn2
[1] "Programme du week-end: \xed\xa0\xbd\xed\xb2\x83\xed\xa0\xbc \xed\xbe\xb6\xed\xa0\xbc 
    \xed\xbd\xbb\xed\xa0\xbc\xed\xbd\xbb\xed\xa0\xbc \xed\xbd\xbb\xed\xa0\xbc\xed\xbd\xbb"

确保我有UTF-8:

> Encoding(tweetn2)
[1] "UTF-8


现在当我想要识别一些角色时,它不能正常工作

" Now when I'm trying to recognize some characters, it's not working fine

> grepl("\\xed",tweetn2)
[1] FALSE

> grepl("xed",tweetn2)
[1] FALSE

但是似乎emojis \xed\xa0\xbd不是UTF-8编码,因为我写下来时会收到一条错误消息:

But it seems that emojis "\xed\xa0\xbd" are not "UTF-8" encoding because I get an error message when I write :

> str(tweetn2)
Error in str.default(tweetn2) : invalid multibyte string, element 1

我通过使用iconv()函数和ASCII编码找到一种解决方案:

http://www.r-bloggers.com/emoticons-decoder-for-social-media-sentiment-analysis-in-r/

I find a kind of solution by using iconv( ) function and "ASCII" encoding there :
http://www.r-bloggers.com/emoticons-decoder-for-social-media-sentiment-analysis-in-r/

但是,我想继续使用UTF-8进行分析,因为它与法国特殊字母(à,é,è,ê, ë,û等等)

But I want to keep using "UTF-8" for my analysis because it works well with french special letters (à, é, è, ê, ë, û, etc.. )

所以你有什么想法可以超越吗?

So do you have an idea how I can get above it?

谢谢

推荐答案

字符串无效UTF-8,如图所示。你在那里有UTF-16编码UTF-8。所以 \xED\xA0\xBD 是高替代品 U + D83D , - 和 \xED\xB2\x83 是低代理 U + DC83

The string is invalid UTF-8, as indicated. What you have there is UTF-16 encoded with UTF-8. So \xED\xA0\xBD is the high surrogate U+D83D, -- and \xED\xB2\x83 is the low surrogate U+DC83

如果你应用神奇的高,低 - >代码点公式,最终会得到实际的代码点:

If you apply the magical High,Low -> Codepoint formula, you'll end up with the actual codepoint:

(0xD83D - 0xD800) * 0x400 + 0xDC83 - 0xDC00 + 0x10000 = 0x1F483

你会看到这是舞者表情符号。不幸的是,我没有对你的建议,因为我不熟悉R,但我可以说你一定想让自己处于这个数据被双重编码的位置!希望有助于您沿着正确的方向碰撞你。

You'll see this is the dancer emoji. Unfortunately I don't have a suggestion for you, as I'm not that familiar with R. But I can say you'd certainly want to get yourself in a position where this data is double encoded! Hope that helps bump you along the correct direction.

这篇关于表情符号在R [UTF-8编码]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆