使用 C# 解析来自 TCP 流的串联、非分隔 XML 消息 [英] Parsing concatenated, non-delimited XML messages from TCP-stream using C#

查看:25
本文介绍了使用 C# 解析来自 TCP 流的串联、非分隔 XML 消息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析通过 TCP 发送到我的 C# 应用程序的 XML 消息.不幸的是,协议不能改变,XML 消息没有定界,也没有使用长度前缀.此外,字符编码不是固定的,但每条消息都以 XML 声明 开头.问题是,如何使用 C# 一次读取一条 XML 消息.

I am trying to parse XML messages which are send to my C# application over TCP. Unfortunately, the protocol can not be changed and the XML messages are not delimited and no length prefix is used. Moreover the character encoding is not fixed but each message starts with an XML declaration <?xml>. The question is, how can i read one XML message at a time, using C#.

到目前为止,我尝试将 TCP 流中的数据读取到字节数组中,并通过 MemoryStream 使用它.问题是,缓冲区可能包含多个 XML 消息,或者第一条消息可能不完整.在这些情况下,当我尝试使用 XmlReader.ReadXmlDocument.Load 解析它时会遇到异常,但不幸的是 XmlException 并没有让我来区分问题(除了解析本地化的错误字符串).

Up to now, I tried to read the data from the TCP stream into a byte array and use it through a MemoryStream. The problem is, the buffer might contain more than one XML messages or the first message may be incomplete. In these cases, I get an exception when trying to parse it with XmlReader.Read or XmlDocument.Load, but unfortunately the XmlException does not really allow me to distinguish the problem (except parsing the localized error string).

我尝试使用 XmlReader.Read 并计算 ElementEndElement 节点的数量.这样我就知道什么时候读完第一个完整的 XML 消息.

I tried using XmlReader.Read and count the number of Element and EndElement nodes. That way I know when I am finished reading the first, entire XML message.

但是,有几个问题.如果缓冲区尚未包含整个消息,我如何将 XmlException 与实际无效、格式不正确的消息区分开来?换句话说,如果在读取第一个根 EndElement 之前抛出异常,我如何决定是错误地中止连接,还是从 TCP 流中收集更多字节?

However, there are several problems. If the buffer does not yet contain the entire message, how can I distinguish the XmlException from an actually invalid, non-well-formed message? In other words, if an exception is thrown before reading the first root EndElement, how can I decide whether to abort the connection with error, or to collect more bytes from the TCP stream?

如果没有异常发生,XmlReader 位于根 EndElement 的开头.将 XmlReader 转换为 IXmlLineInfo 为我提供了当前的 LineNumberLinePosition,但是获得EndElement 真正结束的字节位置.为了做到这一点,我必须将字节数组转换为字符串(使用 XML 声明中指定的编码),寻找 LineNumber,LinePosition 并将其转换回到字节偏移量.我尝试使用 StreamReader.ReadLine 来做到这一点,但流阅读器没有公开访问当前字节位置.

If no exception occurs, the XmlReader is positioned at the start of the root EndElement. Casting the XmlReader to IXmlLineInfo gives me the current LineNumber and LinePosition, however it is not straight forward to get the byte position where the EndElement really ends. In order to do that, I would have to convert the byte array into a string (with the encoding specified in the XML declaration), seek to LineNumber,LinePosition and convert that back to the byte offset. I try to do that with StreamReader.ReadLine, but the stream reader gives no public access to the current byte position.

所有这些接缝都非常不优雅且不坚固.我想知道您是否有更好的解决方案的想法.谢谢.

All this seams very inelegant and non robust. I wonder if you have ideas for a better solution. Thank you.

推荐答案

锁定一段时间后,我想我可以回答我自己的问题如下(我可能错了,欢迎更正):

After locking around for some time I think I can answer my own question as following (I might be wrong, corrections are welcome):

  • 我没有发现任何方法可以让 XmlReader 继续解析第二条 XML 消息(至少不会,如果第二条消息具有 XmlDeclaration).XmlTextReader.ResetState 可以做类似的事情,但为此我必须为所有消息假设相同的编码.因此,我无法将 XmlReader 直接连接到 TcpStream.

  • I found no method so that the XmlReader can continue parsing a second XML message (at least not, if the second message has an XmlDeclaration). XmlTextReader.ResetState could do something similar, but for that I would have to assume the same encoding for all messages. Therefor I could not connect the XmlReader directly to the TcpStream.

关闭XmlReader 后,缓冲区未定位在阅读器的最后位置.所以不可能关闭阅读器并使用新的阅读器继续下一条消息.我猜这是因为读者无法在每个可能的输入流上成功搜索.

After closing the XmlReader, the buffer is not positioned at the readers last position. So it is not possible to close the reader and use a new one to continue with the next message. I guess the reason for this is, that the reader could not successfully seek on every possible input stream.

XmlReader 抛出异常时,无法确定它是由于过早的 EOF 还是由于格式不正确的 XML 而发生的.XmlReader.EOF 不会在出现异常的情况下设置.作为解决方法,我派生了自己的 MemoryBuffer,它将最后一个字节作为单个字节返回.通过这种方式,我知道 XmlReader 对最后一个字节真的很感兴趣,并且以下异常可能是由于消息被截断(这有点草率,因为它可能无法检测到所有格式不正确的消息.但是,在向缓冲区添加更多字节后,迟早会检测到错误.

When XmlReader throws an exception it can not be determined whether it happened because of an premature EOF or because of a non-wellformed XML. XmlReader.EOF is not set in case of an exception. As workaround I derived my own MemoryBuffer, which returns the very last byte as a single byte. This way I know that the XmlReader was really interested in the last byte and the following exception is likely due to a truncated message (this is kinda sloppy, in that it might not detect every non-wellformed message. However, after appending more bytes to the buffer, sooner or later the error will be detected.

我可以将我的 XmlReader 转换为 IXmlLineInfo 接口,从而可以访问 LineNumberLinePosition 当前节点.所以在阅读第一条消息后,我记住了这些位置并用它来截断缓冲区.真正草率的部分来了,因为我必须使用字符编码来获取字节位置.我相信您可以在下面的代码中断处找到测试用例(例如具有混合编码的内部元素).但到目前为止,它适用于我的所有测试.

I could cast my XmlReader to the IXmlLineInfo interface, which gives access to the LineNumber and the LinePosition of the current node. So after reading the first message I remember these positions and use it to truncate the buffer. Here comes the really sloppy part, because I have to use the character encoding to get the byte position. I am sure you could find test cases for the code below where it breaks (e.g. internal elements with mixed encoding). But up to now it worked for all my tests.

这是我想出的解析器类——它可能有用(我知道,它远非完美......)

Here is the parser class I came up with -- may it be useful (I know, its very far from perfect...)

class XmlParser {

    private byte[] buffer = new byte[0];

    public int Length { 
        get {
            return buffer.Length;
        }
    }

    // Append new binary data to the internal data buffer...
    public XmlParser Append(byte[] buffer2) {
        if (buffer2 != null && buffer2.Length > 0) {
            // I know, its not an efficient way to do this.
            // The EofMemoryStream should handle a List<byte[]> ...
            byte[] new_buffer = new byte[buffer.Length + buffer2.Length];
            buffer.CopyTo(new_buffer, 0);
            buffer2.CopyTo(new_buffer, buffer.Length);
            buffer = new_buffer;
        }
        return this;
    }

    // MemoryStream which returns the last byte of the buffer individually,
    // so that we know that the buffering XmlReader really locked at the last
    // byte of the stream.
    // Moreover there is an EOF marker.
    private class EofMemoryStream: Stream {
        public bool EOF { get; private set; }
        private MemoryStream mem_;

        public override bool CanSeek {
            get {
                return false;
            }
        }
        public override bool CanWrite {
            get {
                return false;
            }
        }
        public override bool CanRead {
            get {
                return true;
            }
        }
        public override long Length {
            get { 
                return mem_.Length; 
            }
        }
        public override long Position {
            get {
                return mem_.Position;
            }
            set {
                throw new NotSupportedException();
            }
        }
        public override void Flush() {
            mem_.Flush();
        }
        public override long Seek(long offset, SeekOrigin origin) {
            throw new NotSupportedException();
        }
        public override void SetLength(long value) {
            throw new NotSupportedException();
        }
        public override void Write(byte[] buffer, int offset, int count) {
            throw new NotSupportedException();
        }
        public override int Read(byte[] buffer, int offset, int count) {
            count = Math.Min(count, Math.Max(1, (int)(Length - Position - 1)));
            int nread = mem_.Read(buffer, offset, count);
            if (nread == 0) {
                EOF = true;
            }
            return nread;
        }
        public EofMemoryStream(byte[] buffer) {
            mem_ = new MemoryStream(buffer, false);
            EOF = false;
        }
        protected override void Dispose(bool disposing) {
            mem_.Dispose();
        }

    }

    // Parses the first xml message from the stream.
    // If the first message is not yet complete, it returns null.
    // If the buffer contains non-wellformed xml, it ~should~ throw an exception.
    // After reading an xml message, it pops the data from the byte array.
    public Message deserialize() {
        if (buffer.Length == 0) {
            return null;
        }
        Message message = null;

        Encoding encoding = Message.default_encoding;
        //string xml = encoding.GetString(buffer);

        using (EofMemoryStream sbuffer = new EofMemoryStream (buffer)) {

            XmlDocument xmlDocument = null;
            XmlReaderSettings settings = new XmlReaderSettings();

            int LineNumber = -1;
            int LinePosition = -1;
            bool truncate_buffer = false;

            using (XmlReader xmlReader = XmlReader.Create(sbuffer, settings)) {
                try {
                    // Read to the first node (skipping over some element-types.
                    // Don't use MoveToContent here, because it would skip the
                    // XmlDeclaration too...
                    while (xmlReader.Read() &&
                           (xmlReader.NodeType==XmlNodeType.Whitespace || 
                            xmlReader.NodeType==XmlNodeType.Comment)) {
                    };

                    // Check for XML declaration.
                    // If the message has an XmlDeclaration, extract the encoding.
                    switch (xmlReader.NodeType) {
                        case XmlNodeType.XmlDeclaration: 
                            while (xmlReader.MoveToNextAttribute()) {
                                if (xmlReader.Name == "encoding") {
                                    encoding = Encoding.GetEncoding(xmlReader.Value);
                                }
                            }
                            xmlReader.MoveToContent();
                            xmlReader.Read();
                            break;
                    }

                    // Move to the first element.
                    xmlReader.MoveToContent();

                    if (xmlReader.EOF) {
                        return null;
                    }

                    // Read the entire document.
                    xmlDocument = new XmlDocument();
                    xmlDocument.Load(xmlReader.ReadSubtree());
                } catch (XmlException e) {
                    // The parsing of the xml failed. If the XmlReader did
                    // not yet look at the last byte, it is assumed that the
                    // XML is invalid and the exception is re-thrown.
                    if (sbuffer.EOF) {
                        return null;
                    }
                    throw e;
                }

                {
                    // Try to serialize an internal data structure using XmlSerializer.
                    Type type = null;
                    try {
                        type = Type.GetType("my.namespace." + xmlDocument.DocumentElement.Name);
                    } catch (Exception e) {
                        // No specialized data container for this class found...
                    }
                    if (type == null) {
                        message = new Message();
                    } else {
                        // TODO: reuse the serializer...
                        System.Xml.Serialization.XmlSerializer ser = new System.Xml.Serialization.XmlSerializer(type);
                        message = (Message)ser.Deserialize(new XmlNodeReader(xmlDocument));
                    }
                    message.doc = xmlDocument;
                }

                // At this point, the first XML message was sucessfully parsed.

                // Remember the lineposition of the current end element.
                IXmlLineInfo xmlLineInfo = xmlReader as IXmlLineInfo;
                if (xmlLineInfo != null && xmlLineInfo.HasLineInfo()) {
                    LineNumber = xmlLineInfo.LineNumber;
                    LinePosition = xmlLineInfo.LinePosition;
                }


                // Try to read the rest of the buffer.
                // If an exception is thrown, another xml message appears.
                // This way the xml parser could tell us that the message is finished here.
                // This would be prefered as truncating the buffer using the line info is sloppy.
                try {
                    while (xmlReader.Read()) {
                    }
                } catch {
                    // There comes a second message. Needs workaround for trunkating.
                    truncate_buffer = true;
                }
            }
            if (truncate_buffer) {
                if (LineNumber < 0) {
                    throw new Exception("LineNumber not given. Cannot truncate xml buffer");
                }
                // Convert the buffer to a string using the encoding found before 
                // (or the default encoding).
                string s = encoding.GetString(buffer);

                // Seek to the line.
                int char_index = 0;
                while (--LineNumber > 0) {
                    // Recognize \r , \n , \r\n as newlines...
                    char_index = s.IndexOfAny(new char[] {'\r', '\n'}, char_index);
                    // char_index should not be -1 because LineNumber>0, otherwise an RangeException is 
                    // thrown, which is appropriate.
                    char_index++;
                    if (s[char_index-1]=='\r' && s.Length>char_index && s[char_index]=='\n') {
                        char_index++;
                    }
                }
                char_index += LinePosition - 1;

                var rgx = new System.Text.RegularExpressions.Regex(xmlDocument.DocumentElement.Name + "[ \r\n\t]*\\>");
                System.Text.RegularExpressions.Match match = rgx.Match(s, char_index);
                if (!match.Success || match.Index != char_index) {
                    throw new Exception("could not find EndElement to truncate the xml buffer.");
                }
                char_index += match.Value.Length;

                // Convert the character offset back to the byte offset (for the given encoding).
                int line1_boffset = encoding.GetByteCount(s.Substring(0, char_index));

                // remove the bytes from the buffer.
                buffer = buffer.Skip(line1_boffset).ToArray();
            } else {
                buffer = new byte[0];
            }
        }
        return message;
    }
}

这篇关于使用 C# 解析来自 TCP 流的串联、非分隔 XML 消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆