为什么在操作字符串时更改编码? [英] Why is stringr changing encoding when manipulating strings?

查看:151
本文介绍了为什么在操作字符串时更改编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个奇怪的行为, stringr ,这真是令我烦恼。 stringr 更改而不警告编码包含异常字符的一些字符串,在我的情况下ø,å,æ,é等一些...如果您 str_trim 一个字符向量,那些带有异国信件的字符将被转换为一个新的Encoding。

  letter1<  -  readline('Gimme an ASCII character!')#try q or a 
letter2< - readline('Gimme an non-ASCII character!')#tryø或é
字母< - c(letter1,letter2)
编码(字母)#'未知'
编码(str_trim(字母))混合'未知'和'UTF-8'

这是一个问题,因为我使用data.table(快速)合并大表,该data.table不支持混合编码,因为我找不到一种方法来恢复统一编码。



任何解决方法?



<编辑:我以为我可以回到基本功能,但是它们不保护编码。 粘贴保存,但不是 sub

 编码(str_c('',Letters))#mixed 
编码(sub +','',粘贴('',字母)))混合


解决方案

R并不总是很容易在编码之间进行转换(有一个函数 iconv ,但是这个函数接受的是平台依赖)。但是,您至少可以将字符串的编码标记重置为未知:

  Letters = str_trim )
编码(字母)
#[1]未知UTF-8
编码(字母)=''
编码(字母)
# 1]未知未知

但是,请注意,这只是标记一个字符串的编码,它实际上并不重新编码字符串。因此,这可能导致乱码数据。如在评论中提到的,这最多只是一个黑客,而不是问题的实际修复。



编码体现了R的编码麻烦。文档说:


ASCII字符串永远不会被标记为声明的编码,因为它们的表示在所有支持的编码中是相同的。 p>

...这显然是没有帮助的(也有一点误导性;仅由代码点< 128可能看起来与ASCII字符串不可区分,但是根据编码可以产生不同的结果,这就是为什么它应该被有效地标记为。



有趣的是,在这里, enc2native enc2utf8 将会做出所需的事情 - 两者都会产生不同的编码字母中的字符串,是上面引用的 Encoding 问题的直接后果。


There is this strange behavior of stringr, which is really annoying me. stringr changes without a warning the encoding of some strings that contain exotic characters, in my case ø, å, æ, é and some others... If you str_trim a vector of characters, then those with exotic letters will be converted to a new Encoding.

letter1 <- readline('Gimme an ASCII character!')     # try q or a
letter2 <- readline('Gimme an non-ASCII character!') # try ø or é
Letters <- c(letter1, letter2)
Encoding(Letters)           # 'unknown'
Encoding(str_trim(Letters)) # mixed 'unknown' and 'UTF-8'

This is a problem because I use data.table for (fast) merge of big tables and that data.table does not support mixed encoding and because I could not find a way to get back to the uniform encoding.

Any work-around?

EDIT: i thought I could get back to the base functions, but they don't either protect encoding. paste conserves it, but not sub for instance.

 Encoding(paste(' ', Letters))                 # 'unknown'
 Encoding(str_c(' ', Letters))                 # mixed
 Encoding(sub('^ +', '', paste(' ', Letters))) # mixed

解决方案

R doesn’t always make it easy to convert between encodings (there’s the function iconv for that but what this function accepts is platform dependent). However, at the very least you can always reset the encoding marking of a string to "unknown":

Letters = str_trim(Letters)
Encoding(Letters)
# [1] "unknown" "UTF-8"
Encoding(Letters) = ''
Encoding(Letters)
# [1] "unknown" "unknown"

However, note that this only marks the encoding of a string, it doesn’t actually re-encode the string. As a consequence, this can lead to garbled data. As mentioned in the comments, this is at best a hack, not an actual fix for the problem.

Encoding exemplifies R’s trouble to work properly with encodings. The documentation says:

ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings.

… which is obviously not helpful at all (and also more than a bit misleading; an UTF-8 string consisting only of code points < 128 may look indistinguishable to an ASCII string but operating on it should yield different results depending on encoding, which is why it should effectively be marked).

Interestingly, neither enc2native nor enc2utf8 will do the desired thing here — both will yield in different encodings for the two strings in Letters, a direct consequence of the Encoding problem cited above.

这篇关于为什么在操作字符串时更改编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆