由于字符串包含"incomplete",Swift 4 base64字符串到数据无法工作表情符号 [英] Swift 4 base64 String to Data not working due to String containing "incomplete" emoji

查看:106
本文介绍了由于字符串包含"incomplete",Swift 4 base64字符串到数据无法工作表情符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我来自这篇帖子,但与此同时,我能够将问题隔离为10个字符的字符串.

I am coming from this post Swift 4 JSON String with unknown UTF8 "�" character is not convertible to Data/ Dictionary but meanwhile I was able to isolate the issue to a 10-character-string.

简短介绍:一个用户的应用未显示任何内容.用TextWrangler用纯文本格式查看了他的6kb数据,我发现了2个红色问号

Short intro: one user's app did not show any content. Looking at his 6kb of data in plain text with TextWrangler I found 2 red question marks

.

我试图在问号周围剪切一些以base64编码的数据,并将其转换为无效的数据.一旦我从块中删除了红色问号中的位,它似乎又可以工作了.请看一下我下面的Playground示例:

I tried to cut some chunks of the base64-encoded data around the question marks and convert them to Data which didn't work. As soon as I removed the bits from the red question mark from the chunks it seemed to work again. Please take a look at my following Playground example:

//those do NOT work
let toEndBracket = "ACAAKgBVAFMAQQAqACAnlgAg2DwAIgB9AF0A" // *USA* ' <"}]//
let toMidBracket = "ACAAKgBVAFMAQQAqACAnlgAg2DwAIgB9"     // *USA* ' <"}//
let toCarrot =     "ACAAKgBVAFMAQQAqACAnlgAg2DwA"         // *USA* ' <//
let toSpace =      "ACAAKgBVAFMAQQAqACAnlgAg"             // *USA* ' //

//but this one WORKS
let toApostrophe = "ACAAKgBVAFMAQQAqACAn"                 // *USA* '//
//(basically the last one is without the space before the carrot, I've added the slashes after it to emphasize that)
//clear strings taken from https://www.base64decode.org/ using the UTF-8 setting WITHOUT "Live mode".

if let textData = Data(base64Encoded: toApostrophe) {
    print("Data created")   //works for all of them
    print(textData)
    if let decodedString = String(data: textData, encoding: .utf8) {
        print("WORKED!!!")  //only happens for the toApostrophe
        print(decodedString)
    } else {
        print("DID NOT WORK")
    }
}

因此,一旦包含lgAg,它基本上就会失败.用U29t之类的东西代替确实可以使小字符串再次工作,但是我不能在生产代码中这样做,因为我确信我的例子并不是这个问题的唯一发生.我不在乎会导致这种情况的原始字符/符号/表情符号会发生什么,如果有一种方法可以忽略"它们,那将比已经有用的多了!

So it basically fails as soon as soon as it contains lgAg. Replacing this with something like U29t does make the small strings work again but I can't do this in production code as I am sure my examples aren't the only occurrences of this issue. I don't care what happens with the original characters/ symbols/ emojis that are causing this, if there was a way to just "ignore" them that would be more than helpful already!

这里是发生这种情况的另一个示例:

Here is another example of where this occurs:

//OTHER SYMBOL WITH SAME BEHAVIOR
//not working
let secondFromSpace =  "ACDYPAAiACwA"       // <",//

//WORKING
let secondFromCarrot = "PAAiACwA"           //<",//

这里是其栖息地的原始文字,一个带有表情符号的信使消息说"USA",因此在我的示例文本中我怀疑"USA"是使它破裂的表情符号:

Here is the original text in its habitat, a messenger message saying "USA" with an emoji hence the "USA" in my examples texts and my suspicion it's the emojis that make it break:

如果有人能告诉我如何清理" base64字符串,以便再次将其转换为数据,我将不胜感激.这也可能是由于某些表情符号使用了一些奇怪的编码所致,但是在大多数情况下,该应用接收和显示的表情符号内容就很好.

I'd be grateful if someone can tell me how I can "clean up" the base64 string so it's convertible to data again. It might also be due to some weird encoding with some of the emojis but for the very most cases, the app receives and displays content with emojis just fine.

我终于弄清楚了为什么会这样.这不是解决我问题的迅速方法,但现在至少有一定道理.对于新内容的预览,我剪切了字符串以匹配浏览器的视口.这个特别不幸的用户在显示屏边框的边缘上有美国国旗表情符号.我从来没有想过由多个字母和JavaScript的substring()组成的表情符号.看看图片,这说明了角色是从哪里来的.

I have finally figured out why this is happening. It's not a swift-side solution to my problem but now it makes at least some sense. For previews of new content I cut off strings to match the viewport of the browser. This particular unlucky user has had the USA flag emoji on the edge of the display bezel. Never would I have thought of emojis consisting of multiple letters and JavaScript's substring() decapitating them. Take a look at the picture, this explains where the character comes from etc.

对于在Swift中如何避免/忽略/捕捉问题,我还是很感激的,但是对于遇到这个问题的每个可怜的人,我希望您会偶然发现这个问题.

I would still appreciate an answer as to how to avoid/ignore/catch that in Swift but to every poor soul running into this issue I hope you will stumble across this thread.

推荐答案

(其中有些没有注释,但是试图将其组合在一起并描述解决方案.)

(Some of this is out of comments, but trying to bring it together and describe solutions.)

首先,您的字符串不是UTF-8.它们是UTF-16或格式不正确的UTF-16.有时,UTF-16恰好可以解释为UTF-8,但是当它是UTF-8时,字符串中会散布NULL字符.在您的工作"示例中,它实际上并没有工作.

First, your strings are not UTF-8. They're UTF-16 or malformed UTF-16. Sometimes UTF-16 happens to be interpretable as UTF-8, but when it is, there will be NULL characters scattered through the string. In your "working" example, it's not really working.

let toApostrophe = "ACAAKgBVAFMAQQAqACAn"                 // *USA* '//
if let textData = Data(base64Encoded: toApostrophe) {
    if let decodedString = String(data: textData, encoding: .utf8) {
        print(decodedString)
        print(decodedString.count)
        print(decodedString.map { $0.unicodeScalars.map { $0.value } } )
    } else {
        print("DID NOT DECODE UTF8")
    }
} else {
    print("DID NOT DECODE BASE64")
}

打印:

 *USA* '
15
[[0], [32], [0], [42], [0], [85], [0], [83], [0], [65], [0], [42], [0], [32], [39]]

请注意,字符串的长度为15个字符,而不是您可能期望的8个字符.这是因为它在大多数字符之间都包含一个额外的不可见NULL(0).

Note that the length of string is 15 characters, not 8 like you were probably expecting. That's because it includes an extra invisible NULL (0) between most characters.

toEndBracket并不是合法的UTF-8.这是它的字节:

toEndBracket doesn't happen to be legal UTF-8, however. Here are its bytes:

["00","20","00","2a","00","55","00","53","00","41","00","2a" ," 00," 20," 27," 96," 00," 20," d8," 3c," 00," 22," 00," 7d, "00","5d","00"]

["00", "20", "00", "2a", "00", "55", "00", "53", "00", "41", "00", "2a", "00", "20", "27", "96", "00", "20", "d8", "3c", "00", "22", "00", "7d", "00", "5d", "00"]

直到达到0xd8,这是可以的.从位110开始,这表明它是两个字节序列的开始.但是下一个字节是0x3c,它不是多字节序列的有效第二个字节(它应以10开头,但应以00开头).因此,我们不能将其解码为UTF-8.即使使用decodeCString(_:as:repairingInvalidCodeUnits),也无法解码此字符串,因为它已被嵌入的NULL填充.您必须至少使用正确的编码对其进行解码.

This is ok until it gets to 0xd8. That starts with the bits 110, which indicates that it's the start of a two byte sequence. But the next byte is 0x3c, which is not a valid second byte of a multi-byte sequence (it should start with 10, but it starts with 00). So we can't decode this as UTF-8. Even using decodeCString(_:as:repairingInvalidCodeUnits) can't decode this string because it's filled with embedded NULLs. You've got to decode it using at least the right encoding.

但是,让我们这样做.解码为UTF-16.至少那是接近的,即使它是稍微无效的UTF-16.

But let's do that. Decode as UTF-16. At least that's close, even though it's slightly invalid UTF-16.

let toEndBracket16 = String(data: toEndBracketData, encoding: .utf16)
// " *USA* ➖ �"}]"

现在我们至少可以处理这个问题.不过,它是无效的JSON.因此,我们可以通过过滤将其剥离:

Now we can at least work with this. It's invalid JSON, though. So we can strip that by filtering it:

let legalJSON = String(toEndBracket16.filter { $0 != "\u{FFFD}" })
// " *USA* ➖ "}]"

我真的不推荐这种方法.它非常脆弱,并且基于损坏的输入.修正输入.但是,在您试图解析损坏的输入的世界中,这些就是工具.

I don't really recommend this approach. It's incredibly fragile and based on broken input. Fix the input. But in a world where you're trying to parse broken input, these are the tools.

这篇关于由于字符串包含"incomplete",Swift 4 base64字符串到数据无法工作表情符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆