twitteR 和 R 的 Twitter 表情符号编码问题 [英] Twitter emoji encoding problems with twitteR and R

查看:24
本文介绍了twitteR 和 R 的 Twitter 表情符号编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试建立一种在 Twitter 中查找表情符号的方法,并将它们与可以在 unicode.org 中找到的 unicode 表相关联,但我发现很难识别它们,因为我认为是编码问题或仅仅是我对这个话题的误解.简而言之,我所做的是从 http://www.unicode.org/emoji/charts/full-emoji-list.html 包含表情符号的标题和代码点(代码).我在 R 中用库 rvest 废弃了这个.

I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.

当我使用 R 中的 twitteR API 从 twitter 获取信息时,问题就出现了.因为表情符号的代码与此表中的代码完全不同.

The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.

让我们举一个带有 100(一百点)红色图标的表情符号的例子.这是前链表中的数字1468,其码位代码为:

Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:

U+1F4AF

现在,当我从 twitter 获取它时,首先它在 API 内置的用于处理推文的状态类中是这样显示的.

Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.

xed��xed��

然后,一旦我将其转换为数据帧,我也会使用来自 twitter API 的内置函数来执行此操作.例如:

Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:

tweet$toDataFrame()

表情变成这样:

<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>

我尝试使用 R 中的 iconv 函数对其进行转换,代码如下:

I tried to convert it with the function iconv in R, with the following code:

iconv(tweet$text, from="UTF-8", to="ASCII", "byte)

我只是设法让它看起来像这样:

and I only manage to make it look like this:

<ed><a0><bd><ed><b2><af>

所以,在我的测试结束时,我得到了以下结果:

So, wrapping up and at the end of my tests, I got to the following results:

<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
xed��xed��

没有一个看起来像表指定的代码点:

None of which look like the code point specified by the table:

U+1F4AF

有没有可能在两个字符串之间进行转换?我错过了什么?为什么 Twitter 会为表情符号返回此信息?

Is there any possibility to transform between the two strings? What am I missing? Why is twitter returning this information for emojis?

推荐答案

我之前对 enconding 一无所知,但经过几天的阅读,我想我知道发生了什么.我不完全理解表情符号的编码是如何工作的,但我偶然发现了同样的问题并解决了它.

I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.

您想将 xed xed 映射到其名称解码版本:百点.一种明智的方法是在线抓取字典并使用一个键(例如 Unicode)来替换它.在这种情况下,它将是 U+1F4AF.您显示的转换不是不同的编码,而是相同编码表情符号的不同表示法:

You want to map xed��xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF. The conversions you show are not different encodings but different notation for the same encoded emoji:

  1. as.data.frame(tweet) 返回 <;U+00AF>.
  2. iconv(tweet, from="UTF-8", to="ASCII", "byte") 返回 <ed><a0><bd><ed><b2><af>.
  1. as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
  2. iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.

所以直接使用 Unicode 是不可行的.另一种方法可能是使用已经以 <ed>...<ed>... 方式对表情符号进行编码的字典,就像这里的方式:表情符号列表.瞧!只有她的名单不完整,因为它来自包含较少表情符号的字典.

So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from a dictionary that contains fewer emoticons.

快速的解决方案是简单地抓取一个更完整的字典并将......与其对应的英文文本进行映射翻译.我已经这样做了并发布这里.

The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.

尽管没有其他人使用正确的编码发布列表这一事实让我感到不安.事实上,我发现的大多数词典都有 UTF-8 编码,使用的不是 <ed>...<ed>... 表示,而是 <f0>....事实证明,它们都是相同 unicode U+1F4AF 的正确 UTF-8 编码,只是字节的读取方式不同.

Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.

长答案.该推文以 UTF-16 读取,然后转换为 UTF-8,这就是转换的分歧之处.当读取由字节对完成时,结果将是 UTF-8 <ed>...<ed>...,当它由四个字节的块读取时,结果将是 UTF-8 <f0>...(这是为什么?我不完全明白,但我怀疑这与您的处理器架构有关).

Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).

因此,解决问题的一种较慢(但更有意识)的方法是抓取 <f0>... 字典,将其转换为 UTF-16,再将其转换回 UTF-8成对,你最终会得到两个 ....这两个 ... 被称为 Unicode U+xxxxx 的低高代理对表示.

So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.

举个例子:

unicode <- 0x1F4Af

# Multibyte Version
intToUtf8(unicode)

# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)

返回:

[1] "xf0u009fu0092�"
[1] "xed��xed��"

再次使用 iconv(..., 'utf-8', 'latin1', 'byte'),与:

[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"

PS1.:函数unicode2hilo是hi-lo到unicode的简单线性变换

PS1.: Function unicode2hilo is a simple linear transformation of hi-lo to unicode

unicode2hilo <- function(unicode){
   hi = floor((unicode - 0x10000)/0x400) + 0xd800
   lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
   hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
   return(hilo)
}

hilo2unicode <- function(hi,lo){
   unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000 
   unicode = paste('0x', as.hexmode(unicode), sep = '')
   return(unicode)
}

PS2.:我建议使用 iconv(tweet, 'UTF-8', 'latin1', 'byte') 来保留特殊字符,如 áäà.

PS2.: I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.

PS3.:要将表情符号替换为其英文文本、标签、哈希或任何您想将其映射到的内容,我建议在表情符号图中使用 DFS,因为有些表情符号的 unicode 是其他更简单的 unicodes 的串联(即 <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8>;8f>人侧翻,而独立的 <9f>人侧翻<80><8d> 什么都不是,<99><82> 是一个 男性标志,和<8f>什么都不是),而人侧翻人侧翻男性标志 显然是语义相关的,我更喜欢更忠实的翻译.

PS3.: To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.

这篇关于twitteR 和 R 的 Twitter 表情符号编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆