twitteR 和 R 的 Twitter 表情符号编码问题 [英] Twitter emoji encoding problems with twitteR and R
问题描述
我正在尝试建立一种在 Twitter 中查找表情符号的方法,并将它们与可以在 unicode.org 中找到的 unicode 表相关联,但我发现很难识别它们,因为我认为是编码问题或仅仅是我对这个话题的误解.简而言之,我所做的是从 http://www.unicode.org/emoji/charts/full-emoji-list.html 包含表情符号的标题和代码点(代码).我在 R 中用库 rvest 废弃了这个.
I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.
当我使用 R 中的 twitteR API 从 twitter 获取信息时,问题就出现了.因为表情符号的代码与此表中的代码完全不同.
The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.
让我们举一个带有 100(一百点)红色图标的表情符号的例子.这是前链表中的数字1468,其码位代码为:
Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:
U+1F4AF
现在,当我从 twitter 获取它时,首先它在 API 内置的用于处理推文的状态类中是这样显示的.
Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.
xed��xed��
然后,一旦我将其转换为数据帧,我也会使用来自 twitter API 的内置函数来执行此操作.例如:
Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:
tweet$toDataFrame()
表情变成这样:
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
我尝试使用 R 中的 iconv 函数对其进行转换,代码如下:
I tried to convert it with the function iconv in R, with the following code:
iconv(tweet$text, from="UTF-8", to="ASCII", "byte)
我只是设法让它看起来像这样:
and I only manage to make it look like this:
<ed><a0><bd><ed><b2><af>
所以,在我的测试结束时,我得到了以下结果:
So, wrapping up and at the end of my tests, I got to the following results:
<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
xed��xed��
没有一个看起来像表指定的代码点:
None of which look like the code point specified by the table:
U+1F4AF
有没有可能在两个字符串之间进行转换?我错过了什么?为什么 Twitter 会为表情符号返回此信息?
Is there any possibility to transform between the two strings? What am I missing? Why is twitter returning this information for emojis?
推荐答案
我之前对 enconding 一无所知,但经过几天的阅读,我想我知道发生了什么.我不完全理解表情符号的编码是如何工作的,但我偶然发现了同样的问题并解决了它.
I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.
您想将 xed xed
映射到其名称解码版本:百点.一种明智的方法是在线抓取字典并使用一个键(例如 Unicode)来替换它.在这种情况下,它将是 U+1F4AF
.您显示的转换不是不同的编码,而是相同编码表情符号的不同表示法:
You want to map xed��xed��
to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF
.
The conversions you show are not different encodings but different notation for the same encoded emoji:
as.data.frame(tweet)
返回
.<;U+00AF> iconv(tweet, from="UTF-8", to="ASCII", "byte")
返回<ed><a0><bd><ed><b2><af>
.
as.data.frame(tweet)
returns<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
.iconv(tweet, from="UTF-8", to="ASCII", "byte")
returns<ed><a0><bd><ed><b2><af>
.
所以直接使用 Unicode 是不可行的.另一种方法可能是使用已经以 <ed>...<ed>...
方式对表情符号进行编码的字典,就像这里的方式:表情符号列表.瞧!只有她的名单不完整,因为它来自包含较少表情符号的字典.
So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>...
way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from
a dictionary that contains fewer emoticons.
快速的解决方案是简单地抓取一个更完整的字典并将
与其对应的英文文本进行映射翻译.我已经这样做了并发布这里.
The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>...
with its corresponding english text translation. I have done that already and posted here.
尽管没有其他人使用正确的编码发布列表这一事实让我感到不安.事实上,我发现的大多数词典都有 UTF-8 编码,使用的不是 <ed>...<ed>...
表示,而是 <f0>...代码>.事实证明,它们都是相同 unicode
U+1F4AF
的正确 UTF-8 编码,只是字节的读取方式不同.
Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>...
representation but rather <f0>...
. It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF
only the Bytes are read differently.
长答案.该推文以 UTF-16 读取,然后转换为 UTF-8,这就是转换的分歧之处.当读取由字节对完成时,结果将是 UTF-8 <ed>...<ed>...
,当它由四个字节的块读取时,结果将是 UTF-8 <f0>...
(这是为什么?我不完全明白,但我怀疑这与您的处理器架构有关).
Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>...
, when it is read by chunks of four bytes the result will be UTF-8 <f0>...
(Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).
因此,解决问题的一种较慢(但更有意识)的方法是抓取 <f0>...
字典,将其转换为 UTF-16,再将其转换回 UTF-8成对,你最终会得到两个
.这两个
被称为 Unicode U+xxxxx
的低高代理对表示.
So a slower (but more conscious) way to solve your problem is to scrape the <f0>...
dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>...
. These two <ed>...
is known as the low-high surrogate pair representation for the Unicode U+xxxxx
.
举个例子:
unicode <- 0x1F4Af
# Multibyte Version
intToUtf8(unicode)
# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)
返回:
[1] "xf0u009fu0092�"
[1] "xed��xed��"
再次使用 iconv(..., 'utf-8', 'latin1', 'byte')
,与:
[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"
PS1.:函数unicode2hilo
是hi-lo到unicode的简单线性变换
PS1.:
Function unicode2hilo
is a simple linear transformation of hi-lo to unicode
unicode2hilo <- function(unicode){
hi = floor((unicode - 0x10000)/0x400) + 0xd800
lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
return(hilo)
}
hilo2unicode <- function(hi,lo){
unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000
unicode = paste('0x', as.hexmode(unicode), sep = '')
return(unicode)
}
PS2.:我建议使用 iconv(tweet, 'UTF-8', 'latin1', 'byte')
来保留特殊字符,如 áäà.
PS2.:
I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte')
to preserve special characters like áäà.
PS3.:要将表情符号替换为其英文文本、标签、哈希或任何您想将其映射到的内容,我建议在表情符号图中使用 DFS,因为有些表情符号的 unicode 是其他更简单的 unicodes 的串联(即 <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8>;8f>
是人侧翻,而独立的
是人侧翻、
什么都不是,
是一个 男性标志,和
什么都不是),而人侧翻和人侧翻男性标志 显然是语义相关的,我更喜欢更忠实的翻译.
PS3.:
To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f>
is a man cartwheeling, while independently <f0><9f><a4><b8>
is person cartwheeling, <e2><80><8d>
is nothing, <e2><99><82>
is a male sign, and <ef><b8><8f>
is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.
这篇关于twitteR 和 R 的 Twitter 表情符号编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!