字符串中的转义unicode [英] Unescape unicode in character string

查看:149
本文介绍了字符串中的转义unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

RJSONIO中有一个长期存在的 bug ,用于解析包含unicode转义序列的json字符串.似乎该错误需要在libjson中修复,而该错误可能不会很快发生,因此我正在寻找在R中创建一种变通方法,该解决方案在将\uxxxx序列提供给json解析器之前先对其进行转义.

There is a long standing bug in RJSONIO for parsing json strings containing unicode escape sequences. It seems like the bug needs to be fixed in libjson which might not happen any time soon, so I am looking in creating a workaround in R which unescapes \uxxxx sequences before feeding them to the json parser.

某些上下文:json数据始终是unicode,默认情况下使用utf-8,因此通常无需转义.但是出于历史原因,json确实支持转义的unicode.因此json数据

Some context: json data is always unicode, using utf-8 by default, so there is generally no need for escaping. But for historical reasons, json does support escaped unicode. Hence the json data

{"x" : "Zürich"}

{"x" : "Z\u00FCrich"}

是等效的,并且在解析时应产生完全相同的输出.但是无论出于何种原因,后者在RJSONIO中均不起作用.其他混乱是由于R本身也支持转义的unicode.因此,当我们在R控制台中键入"Z\u00FCrich"时,它会自动正确转换为"Zürich".为了获得实际的json字符串,我们需要转义反斜杠本身,它是json中unicode转义序列的第一个字符:

are equivalent and should result in exactly the same output when parsed. But for whatever reason, the latter doesn't work in RJSONIO. Additional confusion is caused by the fact that R itself supports escaped unicode as well. So when we type "Z\u00FCrich" in an R console, it is automatically correctly converted to "Zürich". To get the actual json string at hand, we need to escape the backslash itself that is the first character of the unicode escape sequence in json:

test <- '{"x" : "Z\\u00FCrich"}'
cat(test)

所以我的问题是:给定R中的一个大json字符串,我该如何对所有转义的unicode序列进行转义? IE.如何用相应的unicode字符替换所有出现的\uxxxx?同样,此处的\uxxxx表示一个实际的6个字符的字符串,以反斜杠开头.因此,unescape函数应满足:

So my question is: given a large json string in R, how can I unescape all escaped unicode sequences? I.e. how do I replace all occurrences of \uxxxx by the corresponding unicode character? Again, the \uxxxx here represents an actual string of 6 characters, starting with a backslash. So an unescape function should satisfy:

#Escaped string
escaped <- "Z\\u00FCrich"

#Unescape unicode
unescape(escaped) == "Zürich"

#This is the same thing
unescape(escaped) == "Z\u00FCrich"

可能会使事情复杂化的是,如果反斜杠本身在json中与另一个反斜杠一起转义,则它不是Unicode转义序列的一部分.例如. unescape还应满足:

One thing that might complicate things is that if the backslash itself is escaped in json with another backslash, it is not part of the unicode escape sequence. E.g. unescape should also satisfy:

#Watch out for escaped backslashes
unescape("Z\\\\u00FCrich") == "Z\\\\u00FCrich"
unescape("Z\\\\\\u00FCrich") == "Z\\\\ürich"

推荐答案

在玩了更多之后,我认为我能做的最好的事情是使用正则表达式搜索\uxxxx模式,然后使用R解析器对其进行解析:

After playing with this some more I think the best I can do is searching for \uxxxx patterns using a regular expression, and then parse those using the R parser:

unescape_unicode <- function(x){
  #single string only
  stopifnot(is.character(x) && length(x) == 1)

  #find matches
  m <- gregexpr("(\\\\)+u[0-9a-z]{4}", x, ignore.case = TRUE)

  if(m[[1]][1] > -1){
    #parse matches
    p <- vapply(regmatches(x, m)[[1]], function(txt){
      gsub("\\", "\\\\", parse(text=paste0('"', txt, '"'))[[1]], fixed = TRUE, useBytes = TRUE)
    }, character(1), USE.NAMES = FALSE)

    #substitute parsed into original
    regmatches(x, m) <- list(p)
  }

  x
}

这似乎适用于所有情况,而且我还没有发现任何奇怪的副作用

This seems to work for all cases and I haven't found any odd side effects yet

这篇关于字符串中的转义unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆