如何在不创建大缓冲区的情况下将 .NET 对象的大图序列化为 SQL Server BLOB? [英] How to I serialize a large graph of .NET object into a SQL Server BLOB without creating a large buffer?

查看:9
本文介绍了如何在不创建大缓冲区的情况下将 .NET 对象的大图序列化为 SQL Server BLOB?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有这样的代码:

ms = New IO.MemoryStream
bin = New System.Runtime.Serialization.Formatters.Binary.BinaryFormatter
bin.Serialize(ms, largeGraphOfObjects)
dataToSaveToDatabase = ms.ToArray()
// put dataToSaveToDatabase in a Sql server BLOB

但是内存流从大内存堆中分配了一个大缓冲区,这给我们带来了问题.那么我们如何在不需要足够的空闲内存来保存序列化对象的情况下流式传输数据.

But the memory steam allocates a large buffer from the large memory heap that is giving us problems. So how can we stream the data without needing enough free memory to hold the serialized objects.

我正在寻找一种从 SQL 服务器获取流的方法,然后可以将其传递给 bin.Serialize() 以避免将所有数据保留在我的进程内存中.

I am looking for a way to get a Stream from SQL server that can then be passed to bin.Serialize() so avoiding keeping all the data in my processes memory.

同样用于读取数据...

更多背景信息.

这是一个复杂的数值处理系统的一部分,该系统近乎实时地处理数据以查找设备问题等,序列化完成后允许在数据馈送等数据质量出现问题时重新启动.(我们存储数据馈送,并可以在操作员编辑掉错误值后重新运行它们.)

This is part of a complex numerical processing system that processes data in near real time looking for equipment problems etc, the serialization is done to allow a restart when there is a problem with data quality from a data feed etc. (We store the data feeds and can rerun them after the operator has edited out bad values.)

因此,我们更频繁地序列化对象,然后反序列化它们.

Therefore we serialize the object a lot more often then we de-serialize them.

我们正在序列化的对象包括非常大的数组,主要是双精度数组以及许多小的更普通"的对象.我们正在推动 32 位系统的内存限制并使垃圾收集器非常努力地工作.(正在系统中的其他地方产生效果以改进这一点,例如重用大型阵列而不是创建新阵列.)

The objects we are serializing include very large arrays mostly of doubles as well as a lot of small "more normal" objects. We are pushing the memory limit on 32 bit systems and make the garbage collector work very hard. (Effects are being made elsewhere in the system to improve this, e.g. reusing large arrays rather then create new arrays.)

状态的序列化通常是最后一根稻草导致内存不足异常;我们的内存使用高峰总是在这个序列化步骤.

Often the serialization of the state is the last straw that causes an out of memory exception; the peak of our memory usage is always during this serialization step.

认为当我们反序列化对象时,我们会得到大内存池碎片,我预计鉴于数组的大小,大内存池碎片还会有其他问题.(这个还没研究,因为第一个看这个的人是数值处理专家,不是内存管理专家.)

I think we get large memory pool fragmentation when we de-serialize the object, I expect there are also other problems with large memory pool fragmentation given the size of the arrays. (This has not yet been investigated, as the person that first looked at this is a numerical processing expert, not a memory management expert.)

我们的客户混合使用 SQL Server 2000、2005 和 2008,如果可能,我们宁愿不为每个版本的 SQL Server 使用不同的代码路径.

Our customers use a mix of SQL Server 2000, 2005 and 2008 and we would rather not have different code paths for each version of SQL Server if possible.

我们一次可以有许多活动模型(在不同的进程中,跨多台机器),每个模型可以有许多保存的状态.因此,保存的状态存储在数据库 blob 中,而不是文件中.

We can have many active models at a time (in different processes, across many machines), each model can have many saved states. Hence the saved state is stored in a database blob rather then a file.

由于保存状态的传播很重要,我宁愿不将对象序列化为文件,然后将文件一次一个块地放入BLOB中.

As the spread of saving the state is important, I would rather not serialize the object to a file, and then put the file in a BLOB one block at a time.

我问过的其他相关问题

推荐答案

没有内置的 ADO.Net 功能可以真正优雅地处理大数据.问题有两个方面:

There is no built-in ADO.Net functionality to handle this really gracefully for large data. The problem is two fold:

  • 没有像流中那样写入"SQL 命令或参数的 API.接受流的参数类型(如FileStream)接受流从它READ,这不符合write的序列化语义成一条溪流.不管你用哪一种方式改变它,你最终都会得到整个序列化对象的内存副本,很糟糕.
  • 即使上面的点可以解决(并且不能解决),TDS 协议和 SQL Server 接受参数的方式也不能很好地处理大参数,因为在启动执行之前必须首先接收整个请求这将在 SQL Server 内部创建对象的额外副本.
  • there is no API to 'write' into a SQL command(s) or parameters as into a stream. The parameter types that accept a stream (like FileStream) accept the stream to READ from it, which does not agree with the serialization semantics of write into a stream. No matter which way you turn this, you end up with a in memory copy of the entire serialized object, bad.
  • even if the point above would be solved (and it cannot be), the TDS protocol and the way SQL Server accepts parameters do not work well with large parameters as the entire request has to be first received before it is launched into execution and this would create additional copies of the object inside SQL Server.

所以你真的必须从不同的角度来解决这个问题.幸运的是,有一个相当简单的解决方案.诀窍是使用高效的 UPDATE .WRITE 语法,并在一系列 T-SQL 语句中逐个传入数据块.这是 MSDN 推荐的方式,请参见 Modifying Large-Value (max) ADO.NET 中的数据.这看起来很复杂,但实际上很容易做到并插入 Stream 类.

So you really have to approach this from a different angle. Fortunately, there is a fairly easy solution. The trick is to use the highly efficient UPDATE .WRITE syntax and pass in the chunks of data one by one, in a series of T-SQL statements. This is the MSDN recommended way, see Modifying Large-Value (max) Data in ADO.NET. This looks complicated, but is actually trivial to do and plug into a Stream class.

BlobStream 类

这是解决方案的主要内容.一个 Stream 派生类,它将 Write 方法实现为对 T-SQL BLOB WRITE 语法的调用.直截了当,唯一有趣的是它必须跟踪第一次更新,因为 UPDATE ... SET blob.WRITE(...) 语法会在 NULL 字段上失败:

This is the bread and butter of the solution. A Stream derived class that implements the Write method as a call to the T-SQL BLOB WRITE syntax. Straight forward, the only thing interesting about it is that it has to keep track of the first update because the UPDATE ... SET blob.WRITE(...) syntax would fail on a NULL field:

class BlobStream: Stream
{
    private SqlCommand cmdAppendChunk;
    private SqlCommand cmdFirstChunk;
    private SqlConnection connection;
    private SqlTransaction transaction;

    private SqlParameter paramChunk;
    private SqlParameter paramLength;

    private long offset;

    public BlobStream(
        SqlConnection connection,
        SqlTransaction transaction,
        string schemaName,
        string tableName,
        string blobColumn,
        string keyColumn,
        object keyValue)
    {
        this.transaction = transaction;
        this.connection = connection;
        cmdFirstChunk = new SqlCommand(String.Format(@"
UPDATE [{0}].[{1}]
    SET [{2}] = @firstChunk
    WHERE [{3}] = @key"
            ,schemaName, tableName, blobColumn, keyColumn)
            , connection, transaction);
        cmdFirstChunk.Parameters.AddWithValue("@key", keyValue);
        cmdAppendChunk = new SqlCommand(String.Format(@"
UPDATE [{0}].[{1}]
    SET [{2}].WRITE(@chunk, NULL, NULL)
    WHERE [{3}] = @key"
            , schemaName, tableName, blobColumn, keyColumn)
            , connection, transaction);
        cmdAppendChunk.Parameters.AddWithValue("@key", keyValue);
        paramChunk = new SqlParameter("@chunk", SqlDbType.VarBinary, -1);
        cmdAppendChunk.Parameters.Add(paramChunk);
    }

    public override void Write(byte[] buffer, int index, int count)
    {
        byte[] bytesToWrite = buffer;
        if (index != 0 || count != buffer.Length)
        {
            bytesToWrite = new MemoryStream(buffer, index, count).ToArray();
        }
        if (offset == 0)
        {
            cmdFirstChunk.Parameters.AddWithValue("@firstChunk", bytesToWrite);
            cmdFirstChunk.ExecuteNonQuery();
            offset = count;
        }
        else
        {
            paramChunk.Value = bytesToWrite;
            cmdAppendChunk.ExecuteNonQuery();
            offset += count;
        }
    }

    // Rest of the abstract Stream implementation
 }

<小时>

使用 BlobStream

要使用这个新创建的 blob 流类,请插入 BufferedStream.该类具有简单的设计,仅处理将流写入表的列中.我将重用另一个示例中的表格:

To use this newly created blob stream class you plug into a BufferedStream. The class has a trivial design that handles only writing the stream into a column of a table. I'll reuse a table from another example:

CREATE TABLE [dbo].[Uploads](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [FileName] [varchar](256) NULL,
    [ContentType] [varchar](256) NULL,
    [FileData] [varbinary](max) NULL)

我将添加一个要序列化的虚拟对象:

I'll add a dummy object to be serialized:

[Serializable]
class HugeSerialized
{
    public byte[] theBigArray { get; set; }
}

最后是实际的序列化.我们将首先在 Uploads 表中插入一条新记录,然后在新插入的 Id 上创建一个 BlobStream 并直接在此流中调用序列化:

Finally, the actual serialization. We'll first insert a new record into the Uploads table, then create a BlobStream on the newly inserted Id and call the serialization straight into this stream:

using (SqlConnection conn = new SqlConnection(Settings.Default.connString))
{
    conn.Open();
    using (SqlTransaction trn = conn.BeginTransaction())
    {
        SqlCommand cmdInsert = new SqlCommand(
@"INSERT INTO dbo.Uploads (FileName, ContentType)
VALUES (@fileName, @contentType);
SET @id = SCOPE_IDENTITY();", conn, trn);
        cmdInsert.Parameters.AddWithValue("@fileName", "Demo");
        cmdInsert.Parameters.AddWithValue("@contentType", "application/octet-stream");
        SqlParameter paramId = new SqlParameter("@id", SqlDbType.Int);
        paramId.Direction = ParameterDirection.Output;
        cmdInsert.Parameters.Add(paramId);
        cmdInsert.ExecuteNonQuery();

        BlobStream blob = new BlobStream(
            conn, trn, "dbo", "Uploads", "FileData", "Id", paramId.Value);
        BufferedStream bufferedBlob = new BufferedStream(blob, 8040);

        HugeSerialized big = new HugeSerialized { theBigArray = new byte[1024 * 1024] };
        BinaryFormatter bf = new BinaryFormatter();
        bf.Serialize(bufferedBlob, big);

        trn.Commit();
    }
}

<小时>

如果您监视这个简单示例的执行,您将看到没有创建大型序列化流.该示例将分配 [1024*1024] 的数组,但这是为了演示目的,以便进行序列化.此代码使用 SQL Server BLOB 建议的每次 8040 字节更新大小,以缓冲方式逐块序列化.


If you monitor the execution of this simple sample you'll see that nowhere is a large serialization stream created. The sample will allocate the array of [1024*1024] but that is for demo purposes to have something to serialize. This code serializes in a buffered manner, chunk by chunk, using the SQL Server BLOB recommended update size of 8040 bytes at a time.

这篇关于如何在不创建大缓冲区的情况下将 .NET 对象的大图序列化为 SQL Server BLOB?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆