如何正确处理国际化文本? [英] How to properly dput internationalized text?

查看:70
本文介绍了如何正确处理国际化文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆来自国外的CSV格式的作者姓名,R读起来很好.我正在尝试清理它们以上传到Mechanical Turk(它甚至不喜欢单个国际化字符).这样,我有一个问题(稍后发布),但是我什至无法以一种明智的方式dput:

I have a bunch of author names from foreign countries in a CSV which R reads in just fine. I'm trying to clean them for upload to Mechanical Turk (which really doesn't like even a single internationalized character). In so doing, I have a question (to be posted later), but I can't even dput them in a sensible way:

> dput(df[306,"primauthfirstname"])
"Gwena\xeblle M"
> test <- "Gwena\xeblle M"
<simpleError in nchar(val): invalid multibyte string 1>

换句话说,dput可以正常工作,但是将结果粘贴到失败.为什么dput不输出必要的信息以允许复制/粘贴回R(大概需要做的就是在一个structure语句中添加编码属性?).我该怎么做呢?

In other words, dput works just fine, but pasting the result in fails. Why doesn't dput output the necessary information to allow copy/pasting back into R (presumably all it needs to do is add the encoding attributes the a structure statement?). How do I get it to do so?

请注意,就R而言,\xeb是有效字符:

Note that \xeb is a valid character as far as R is concerned:

> gsub("\xeb","", turk.df[306,"primauthfirstname"] )
[1] "Gwenalle M"

但是您不能单独评估字符-它是十六进制代码\ x ##或什么都没有:

But that you can't evaluate the characters individually--it's hex code \x## or nothing:

> gsub("\\x","", turk.df[306,"primauthfirstname"] )
[1] "Gwena\xeblle M"

推荐答案

dput()的帮助页面说:编写R对象的ASCII文本表示形式".因此,如果您的对象包含非ASCII字符,则无法表示这些字符,而必须以某种方式进行转换.

dput()'s helppage says: "Writes an ASCII text representation of an R object". So if your object contains non-ASCII characters, these cannot be represented and have to be converted somehow.

所以我建议您在dput ing之前使用iconv()转换向量.一种方法是:

So I would suggest you use iconv() to convert your vector before dputing. One approach is:

> test <- "Gwena\xeblle M"
> out <- iconv(test, from="latin1", to="ASCII", sub="byte")
> out
[1] "Gwena<eb>lle M"
> gsub('<eb>', 'ë', out)
[1] "Gwenaëlle M"

如您所见,

可以双向工作.您以后可以使用gsub()将字节反向转换为字符(如果编码支持,例如utf-8).

which, as you see, works both ways. You can later use gsub() to back-convert bytes into characters (if your encoding supports it, e.g. utf-8).

第二种方法更简单(我想更适合您的需求),但是它可以单向运行,而您的libiconv可能不支持它:

The second approach is simpler (and I guess preferable for your needs), but works one-way and your libiconv may not support it:

> test <- "Gwena\xeblle M"
> iconv(test, from="latin1", to="ASCII//TRANSLIT")
[1] "Gwenaelle M"

希望这会有所帮助!

这篇关于如何正确处理国际化文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆