什么是将流式传输(utf8)数据转换为字符串的安全方法? [英] What is a safe way to turn streamed (utf8) data into a string?

查看:182
本文介绍了什么是将流式传输(utf8)数据转换为字符串的安全方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我是一个用objc / swift编写的服务器。客户端正在向我发送大量的数据,这是一个很大的utf8编码字符串。作为服务器,我有我的NSInputStream触发事件,说它有数据读。我抓住数据,并与它建立一个字符串。

Suppose I'm a server written in objc/swift. The client is sending me a large amount of data, which is really a large utf8 encoded string. As the server, i have my NSInputStream firing events to say it has data to read. I grab the data and build up a string with it.

但是,如果我得到的下一个数据块落在utf8数据的不幸位置怎么办?喜欢在一个组成的性格。如果你试图附加一个不符合UTF8的块,它似乎会乱糟糕的字符串。

However what if the next chunk of data I get falls on an unfortunate position in the utf8 data? Like on a composed character. It seems like it would mess the string up if you try to append a chunk of non compliant utf8 to it.

什么是适当的方式来处理这个?我想我可以只保留数据作为NSData,但是我没有反正知道什么时候数据已经完成接收(认为HTTP的数据长度在头部)。

What is a suitable way to deal with this? I was thinking I could just keep the data as an NSData, but then I don't have anyway to know when the data has finished being received (think HTTP where the length of data is in the header).

感谢任何想法。

推荐答案

您可能想在这里使用的工具是 UTF8 。它将为您处理所有的状态问题。请参见如何将解密的UInt8转换为String?一个简单的例子,你可能适应。

The tool you probably want to use here is UTF8. It will handle all the state issues for you. See How to cast decrypted UInt8 to String? for a simple example that you can likely adapt.

从UTF-8数据构建字符串的主要问题不是字符,而是多字节字符。 拉丁小写字母A+组合字幕存在工作正常,即使分别解码每个字符。不工作是收集你的第一个字节,解码,然后附加解码的第二个字节。 UTF8 类型会为你处理这个。所有你需要做的是将你的 NSInputStream 桥接到 GeneratorType

The major concern in building up a string from UTF-8 data isn't composed characters, but rather multi-byte characters. "LATIN SMALL LETTER A" + "COMBINING GRAVE ACCENT" works fine even if decode each of those characters separately. What doesn't work is gathering the first byte of 你, decoding it, and then appending the decoded second byte. The UTF8 type will handle this for you, though. All you need to do is bridge your NSInputStream to a GeneratorType.

这里是一个基本的(不完全生产就绪)的例子我在说什么。首先,我们需要一种将 NSInputStream 转换为生成器的方法。这可能是最困难的部分:

Here's a basic (not fully production-ready) example of what I'm talking about. First, we need a way to convert an NSInputStream into a generator. That's probably the hardest part:

final class StreamGenerator {
    static let bufferSize = 1024
    let stream: NSInputStream
    var buffer = [UInt8](count: StreamGenerator.bufferSize, repeatedValue: 0)
    var buffGen = IndexingGenerator<ArraySlice<UInt8>>([])

    init(stream: NSInputStream) {
        self.stream = stream
        stream.open()
    }
}

extension StreamGenerator: GeneratorType {
    func next() -> UInt8? {
        // Check the stream status
        switch stream.streamStatus {
        case .NotOpen:
            assertionFailure("Cannot read unopened stream")
            return nil
        case .Writing:
            preconditionFailure("Impossible status")
        case .AtEnd, .Closed, .Error:
            return nil // FIXME: May want a closure to post errors
        case .Opening, .Open, .Reading:
            break
        }

        // First see if we can feed from our buffer
        if let result = buffGen.next() {
            return result
        }

        // Our buffer is empty. Block until there is at least one byte available
        let count = stream.read(&buffer, maxLength: buffer.capacity)

        if count <= 0 { // FIXME: Probably want a closure or something to handle error cases
            stream.close()
            return nil
        }

        buffGen = buffer.prefix(count).generate()
        return buffGen.next()
    }
}

next()可以阻止这里,所以它不应该在主队列上调用,但除此之外,它是一个标准的生成器,吐出字节。 (这也可能是有很多小角落的情况下,我不处理,所以你想仔细考虑这一点,但是,这不是那么复杂。)

Calls to next() can block here, so it should not be called on the main queue, but other than that, it's a standard Generator that spits out bytes. (This is also the piece that probably has lots of little corner cases that I'm not handling, so you want to think this through pretty carefully. Still, it's not that complicated.)

这样,创建一个UTF-8解码生成器几乎是微不足道的:

With that, creating a UTF-8 decoding generator is almost trivial:

final class UnicodeScalarGenerator<ByteGenerator: GeneratorType where ByteGenerator.Element == UInt8> {
    var byteGenerator: ByteGenerator
    var utf8 = UTF8()
    init(byteGenerator: ByteGenerator) {
        self.byteGenerator = byteGenerator
    }
}

extension UnicodeScalarGenerator: GeneratorType {
    func next() -> UnicodeScalar? {
        switch utf8.decode(&byteGenerator) {
        case .Result(let scalar): return scalar
        case .EmptyInput: return nil
        case .Error: return nil // FIXME: Probably want a closure or something to handle error cases
        }
    }
}

你当然可以把它转换成一个CharacterGenerator(使用 Character(_:UnicodeScalar))。

You could of course trivially turn this into a CharacterGenerator instead (using Character(_:UnicodeScalar)).

最后一个问题是如果要组合所有的组合标记,使得LATIN SMALL LETTER A和COMBINING GRAVE ACCENT总是一起返回(而不是两个字符)。这实际上比它听起来有点棘手。首先,您需要生成字符串,而不是字符。然后你需要一个好的方法来知道所有的组合字符。这当然是可以知道的,但我有一个麻烦导出一个简单的算法。 Cocoa中没有combiningMarkCharacterSet。我还在想它。获取大多数工作的东西很容易,但我还不确定如何构建它,以使它对于所有的Unicode都是正确的。

The last problem is if you want to combine all combining marks, such that "LATIN SMALL LETTER A" followed by "COMBINING GRAVE ACCENT" would always be returned together (rather than as the two characters they are). That's actually a bit trickier than it sounds. First, you'd need to generate Strings, not Characters. And then you'd need a good way to know what all the combining characters are. That's certainly knowable, but I'm having a little trouble deriving a simple algorithm. There's no "combiningMarkCharacterSet" in Cocoa. I'm still thinking about it. Getting something that "mostly works" is easy, but I'm not sure yet how to build it so that it's correct for all of Unicode.

这里有一个示例程序尝试一下:

Here's a little sample program to try it out:

    let textPath = NSBundle.mainBundle().pathForResource("text.txt", ofType: nil)!
    let inputStream = NSInputStream(fileAtPath: textPath)!
    inputStream.open()

    dispatch_async(dispatch_get_global_queue(0, 0)) {
        let streamGen = StreamGenerator(stream: inputStream)
        let unicodeGen = UnicodeScalarGenerator(byteGenerator: streamGen)
        var string = ""
        for c in GeneratorSequence(unicodeGen) {
            print(c)
            string += String(c)
        }
        print(string)
    }

还有一些文字要读:


Here is some normalish álfa你好 text
And some Zalgo i̝̲̲̗̹̼n͕͓̘v͇̠͈͕̻̹̫͡o̷͚͍̙͖ke̛̘̜̘͓̖̱̬ composed stuff
And one more line with no newline

(第二行是一些 Zalgo编码的文字,这是很好的测试。)

(That second line is some Zalgo encoded text, which is nice for testing.)

我没有做任何测试,在一个真正的阻塞情况,如从网络读取,但它应该基于 NSInputStream 如何工作(即它应该阻塞,直到有至少一个字节读取,但应该填充缓冲区的任何可用的)。

I haven't done any testing with this in a real blocking situation, like reading from the network, but it should work based on how NSInputStream works (i.e. it should block until there's at least one byte to read, but then should just fill the buffer with whatever's available).

我所有的匹配 GeneratorType ,以便它插入其他事情容易,但错误处理可能会更好,如果你没有使用 GeneratorType ,而是创建了你的自己的协议与 next()throws - > Self.Element 。投掷会使得更容易在堆栈中传播错误,但是会使得 for ... in 循环中更难插入。

I've made all of this match GeneratorType so that it plugs into other things easily, but error handling might work better if you didn't use GeneratorType and instead created your own protocol with next() throws -> Self.Element instead. Throwing would make it easier to propagate errors up the stack, but would make it harder to plug into for...in loops.

这篇关于什么是将流式传输(utf8)数据转换为字符串的安全方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆