R-更改数据帧中列的编码? [英] R- Changing encoding of column in dataframe?

查看:60
本文介绍了R-更改数据帧中列的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试更改数据帧中列的编码.

I am trying to change the encoding of a column in a dataframe.

stri_enc_mark(data_updated$text)
#   [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
#  [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
#  [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
#  [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"

当我尝试将其转换时,它不会引发错误,但仍然对矢量无效:

When I try to convert it, it does not throw an error, but still has no effect on the vector:

d <- enc2utf8(data_updated$text)
stri_enc_mark(d)
#   [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
#  [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
#  [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
#  [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"

有什么建议吗?

我在Windows 7(32位)上.添加数据段.

I am on Windows 7, 32bit. Adding data snippet.

> Encoding(data_updated$text[1:35])
 [1] "UTF-8"   "unknown" "unknown" "UTF-8"   "unknown" "unknown" "UTF-8"  
 [8] "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"  
[15] "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown"
[22] "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown"
[29] "unknown" "UTF-8"   "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"

数据看起来像这样.

> data_updated$text[1:35]
 [1] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
 [2] "Deal Talks for Here Mapping Service Expose Reliance on Location Data, via @nytimes #mapping #dilemma  http://t.co/wGdiS5OlRq"                      
 [3] "http://t.co/UZIyX1Rk7W The popping linksexploaded!! http://t.co/KpNntm1dH7 :) http://t.co/oku91uVxZ8"                                              
 [4] "RT @davidsunaria90: Wtch LIVE Mjlis Now\n http://t.co/GXNhe3eY7Y\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/YewOVcz8bb\n…" 
 [5] "Reliance Jio Infocomm: Indian carrier raises $750 million loan for 4G rollout  http://t.co/B2aWlkmwXz"                                             
 [6] "RT @SurjeetInsan: Majlis started in Sirsa Ashram.\nLive @ http://t.co/PR6W5tzZes\nIVR Airtel 55252\nReliance 56300403\n\n#MSGPlsSaveTheEarth"      
 [7] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Techno… http://t.co/kyxTYIxks5"      
 [8] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
 [9] "RT @jaameinsan: Watch LIVE Majlis Now\n http://t.co/nPQegnLXPa\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/txXMtw3zFP\n#M…" 
[10] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Technology"

这些是推文,我认为"http://"链接在这里规定了编码,因为它们具有类似"wGdiS5OlRq"的表达式.为了进行分析,我使用正则表达式删除了这些标签.但是要将原始数据存储在数据库中,我需要这些推文.MongoDB没有问题,但是RDBMS会引发问题.

These are tweets, and I think the "http://" links are dictating encoding here, given that they have expressions like "wGdiS5OlRq". For analysis I had removed these tags using regular expressions. But to store raw data in a DB i need these tweets. MongoDB does not have problem, but a RDBMS throws issues.

推荐答案

在将向量转换为Factor然后再回到字符向量之后,似乎可以使用conv()函数转换编码.说实话有点奇怪.

It appears that we can use the conv() function to convert the encoding after we convert the vector into Factor and then back to character vector. It is a bit strange to be honest.

这篇关于R-更改数据帧中列的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆