twitteR和R的Twitter表情符号编码问题 [英] Twitter emoji encoding problems with twitteR and R

查看:304
本文介绍了twitteR和R的Twitter表情符号编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试建立一种在twitter中找到emojis的方式,并将它们与unicode.org中的unicode表进行关联,但是由于我认为是编码问题,或者简单地我对这个话题的误解。简而言之,我所做的是从 http://www.unicode.org/emoji/charts/full-emoji-list.html ,其中包含表情符号的标题和代码点(代码)。我用R< rvest 在R中废弃了这个。



当我从twitter中获取twitteR API的信息时,问题出现在R.As emojis的代码不像这个表中的代码一样。



让我们用100(一百分)红色图标的表情符号来表示一个例子。这是前面链接表中的1468号,其代码点代码是:

  U + 1F4AF 

现在,当我从twitter抓住它时,首先在状态类中显示API已经内置到使用推文。

 \xed  \xed  

然后,一旦将其转换为数据框,我也使用来自twitter API的内置函数。例如:

  tweet $ toDataFrame()

表情符号成为:

 < ed>< U + 00A0& < U + 00BD>< ED>< U + 00B2>< U + 00AF> 

我尝试使用R中的功能 iconv 转换它,以下代码:

  iconv(tweet $ text,from =UTF-8,to =ASCII,byte) 

我只是设法使它看起来像这样:

 < ed>< a0>< bd>< ed>< b2>< af> 
/ pre>

所以,在我的测试结束时,我得到以下结果:

 < ed>< a0>< bd>< ed>< b2>< af> 
< U + 00A0& U + 00BD>< ed>< U + 00B2< U + 00AF>
\xed \xed

没有一个看起来像表中指定的代码点:

  U + 1F4AF 

有可能在两个字符串之间进行转换吗?
我是什么为什么twitter会返回这个emojis的信息?


解决方案

我以前没有知道任何内容,但是在阅读了几天之后,我想我知道发生了什么。我完全不了解表情符号的编码如何工作,但我偶然发现同样的问题并解决了这个问题。



您想将\xed \xed 映射到其名称解码版本: em> 100分。一个明智的方法可能是在线阅读字典,并使用诸如Unicode的键替换它。在这种情况下,它将是 U + 1F4AF
您显示的转化次数与编码的表情符号不同,但不同的符号:


  1. as .data.frame(tweet)返回< ed>< U + 00A0>< U + 00BD>< ed>< U + 00B2& + 00AF>

  2. iconv(tweet,from =UTF-8,to =ASCII,byte 返回< ed>< a0>< bd>< ed>< b2>< af>

所以直接使用Unicode是不可行的。另一种方法可能是使用已经在&ed; ed> ...< ed> ... 中编码表情符号的字典,如下所示:表情符号列表。瞧!只有她的列表是不完整的,因为它来自
a字典,包含较少的表情符号。



快速解决方案完成字典,并将< ed> ...< ed> ... 与其相应的英文文本翻译映射。我已经完成了,发布了 这里



虽然没有人发布带有正确编码的列表的事实告诉我。事实上,我发现大多数字典都使用UTF-8编码,而不使用< ed> ...< ed> ... 表示,而是< F0> ... 。原来,它们对于相同的unicode U + 1F4AF 都是正确的UTF-8编码,只有字节的读取方式不同。



长回答。该推文以UTF-16读取,然后转换为UTF-8,这里转换分歧。当读取通过字节对来完成时,结果将被UTF-8 < ed> ...< ed> ... 四个字节的结果将是UTF-8 < f0> ... (为什么是这样?我不完全明白,但我怀疑它有事要做使用您的处理器的架构)。



因此,解决您的问题的更慢(但更有意识)的方法是刮取< f0> ... 字典,将其转换为UTF-16,将其转换成UTF-8,最后会出现两个< ed> ... 。这两个< ed> ... 被称为Unicode U + xxxxx 。



例如:

  unicode < -  0x1F4Af 

#多字节版本
intToUtf8(unicode)

#字节对版本
hilo< - unicode2hilo(unicode)
intToUtf8(hilo)

返回:

 code> [1]\xf0\\\Ÿ\\\’ 
[1]\xed \xed
再次使用 iconv(...,'utf-8','latin1','byte')



,与以下相同:

  [1]< f0>< 9f& 92>< AF>中
[1]< ed>< a0>< bd>< ed>< b2>< af&

PS1。
功能 unicode2hilo 是hi-lo到unicode的简单线性转换

  unicode2hilo<  -  function(unicode ){
hi = floor((unicode - 0x10000)/ 0x400)+ 0xd800
lo =(unicode - 0x10000)+ 0xdc00 - (hi-0xd800)* 0x400
hilo = 0x',as.hexmode(c(hi,lo)),sep ='')
return(hilo)
}

hilo2unicode< - function ){
unicode =(hi - 0xD800)* 0x400 + lo - 0xDC00 + 0x10000
unicode = paste('0x',as.hexmode(unicode),sep ='')
return (unicode)
}

PS2。
我建议使用 iconv(tweet,'UTF-8','latin1','byte')保留特殊字符,如áäà。



PS3。
要用其英文文本,标签,哈希或任何要映射的表情符来替换表情符号,我建议使用DFSemojis的图表,因为有一些emojis的unicode是其他更简单的unicode(即< F0>< 9F>< A4>< B8>< E2>< 80><图8d>< E2>< 99>< 82>< EF> < b8< 8f>< 8f> 是一个人转向,而独立< f0>< 9f>< a4>< b8& ; 是,,< e2>< 80< 8d> $ c>< e2>< 99>< 82>
男性标志< ef< b8>< ; 8f> 是没有的),而当人类手推车和人手车男性标志显然在语义上相关时,我更喜欢更诚实的翻译。 p>

I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.

The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.

Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:

U+1F4AF

Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.

\xed��\xed��

Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:

tweet$toDataFrame()

The emoji becomes this:

<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>

I tried to convert it with the function iconv in R, with the following code:

iconv(tweet$text, from="UTF-8", to="ASCII", "byte)

and I only manage to make it look like this:

<ed><a0><bd><ed><b2><af>

So, wrapping up and at the end of my tests, I got to the following results:

<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
\xed��\xed��

None of which look like the code point specified by the table:

U+1F4AF

Is there any possibility to transform between the two strings? What am I missing? Why is twitter returning this information for emojis?

解决方案

I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.

You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF. The conversions you show are not different encodings but different notation for the same encoded emoji:

  1. as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
  2. iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.

So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from a dictionary that contains fewer emoticons.

The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.

Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.

Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).

So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.

As an example:

unicode <- 0x1F4Af

# Multibyte Version
intToUtf8(unicode)

# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)

Returns:

[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"

Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:

[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"

PS1.: Function unicode2hilo is a simple linear transformation of hi-lo to unicode

unicode2hilo <- function(unicode){
   hi = floor((unicode - 0x10000)/0x400) + 0xd800
   lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
   hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
   return(hilo)
}

hilo2unicode <- function(hi,lo){
   unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000 
   unicode = paste('0x', as.hexmode(unicode), sep = '')
   return(unicode)
}

PS2.: I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.

PS3.: To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.

这篇关于twitteR和R的Twitter表情符号编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆