How to I serialize a large graph of .NET object into a SQL Server BLOB without creating a large buffer?
There is no built-in ADO.Net functionality to handle this really gracefully for large data. The problem is two fold:
- there is no API to 'write' into a SQL command(s) or parameters as into a stream. The parameter types that accept a stream (like
FileStream
) accept the stream to READ from it, which does not agree with the serialization semantics of write into a stream. No matter which way you turn this, you end up with a in memory copy of the entire serialized object, bad. - even if the point above would be solved (and it cannot be), the TDS protocol and the way SQL Server accepts parameters do not work well with large parameters as the entire request has to be first received before it is launched into execution and this would create additional copies of the object inside SQL Server.
So you really have to approach this from a different angle. Fortunately, there is a fairly easy solution. The trick is to use the highly efficient UPDATE .WRITE
syntax and pass in the chunks of data one by one, in a series of T-SQL statements. This is the MSDN recommended way, see Modifying Large-Value (max) Data in ADO.NET. This looks complicated, but is actually trivial to do and plug into a Stream class.
The BlobStream class
This is the bread and butter of the solution. A Stream derived class that implements the Write method as a call to the T-SQL BLOB WRITE syntax. Straight forward, the only thing interesting about it is that it has to keep track of the first update because the UPDATE ... SET blob.WRITE(...)
syntax would fail on a NULL field:
class BlobStream: Stream
{
private SqlCommand cmdAppendChunk;
private SqlCommand cmdFirstChunk;
private SqlConnection connection;
private SqlTransaction transaction;
private SqlParameter paramChunk;
private SqlParameter paramLength;
private long offset;
public BlobStream(
SqlConnection connection,
SqlTransaction transaction,
string schemaName,
string tableName,
string blobColumn,
string keyColumn,
object keyValue)
{
this.transaction = transaction;
this.connection = connection;
cmdFirstChunk = new SqlCommand(String.Format(@"
UPDATE [{0}].[{1}]
SET [{2}] = @firstChunk
WHERE [{3}] = @key"
,schemaName, tableName, blobColumn, keyColumn)
, connection, transaction);
cmdFirstChunk.Parameters.AddWithValue("@key", keyValue);
cmdAppendChunk = new SqlCommand(String.Format(@"
UPDATE [{0}].[{1}]
SET [{2}].WRITE(@chunk, NULL, NULL)
WHERE [{3}] = @key"
, schemaName, tableName, blobColumn, keyColumn)
, connection, transaction);
cmdAppendChunk.Parameters.AddWithValue("@key", keyValue);
paramChunk = new SqlParameter("@chunk", SqlDbType.VarBinary, -1);
cmdAppendChunk.Parameters.Add(paramChunk);
}
public override void Write(byte[] buffer, int index, int count)
{
byte[] bytesToWrite = buffer;
if (index != 0 || count != buffer.Length)
{
bytesToWrite = new MemoryStream(buffer, index, count).ToArray();
}
if (offset == 0)
{
cmdFirstChunk.Parameters.AddWithValue("@firstChunk", bytesToWrite);
cmdFirstChunk.ExecuteNonQuery();
offset = count;
}
else
{
paramChunk.Value = bytesToWrite;
cmdAppendChunk.ExecuteNonQuery();
offset += count;
}
}
// Rest of the abstract Stream implementation
}
Using the BlobStream
To use this newly created blob stream class you plug into a BufferedStream
. The class has a trivial design that handles only writing the stream into a column of a table. I'll reuse a table from another example:
CREATE TABLE [dbo].[Uploads](
[Id] [int] IDENTITY(1,1) NOT NULL,
[FileName] [varchar](256) NULL,
[ContentType] [varchar](256) NULL,
[FileData] [varbinary](max) NULL)
I'll add a dummy object to be serialized:
[Serializable]
class HugeSerialized
{
public byte[] theBigArray { get; set; }
}
Finally, the actual serialization. We'll first insert a new record into the Uploads
table, then create a BlobStream
on the newly inserted Id and call the serialization straight into this stream:
using (SqlConnection conn = new SqlConnection(Settings.Default.connString))
{
conn.Open();
using (SqlTransaction trn = conn.BeginTransaction())
{
SqlCommand cmdInsert = new SqlCommand(
@"INSERT INTO dbo.Uploads (FileName, ContentType)
VALUES (@fileName, @contentType);
SET @id = SCOPE_IDENTITY();", conn, trn);
cmdInsert.Parameters.AddWithValue("@fileName", "Demo");
cmdInsert.Parameters.AddWithValue("@contentType", "application/octet-stream");
SqlParameter paramId = new SqlParameter("@id", SqlDbType.Int);
paramId.Direction = ParameterDirection.Output;
cmdInsert.Parameters.Add(paramId);
cmdInsert.ExecuteNonQuery();
BlobStream blob = new BlobStream(
conn, trn, "dbo", "Uploads", "FileData", "Id", paramId.Value);
BufferedStream bufferedBlob = new BufferedStream(blob, 8040);
HugeSerialized big = new HugeSerialized { theBigArray = new byte[1024 * 1024] };
BinaryFormatter bf = new BinaryFormatter();
bf.Serialize(bufferedBlob, big);
trn.Commit();
}
}
If you monitor the execution of this simple sample you'll see that nowhere is a large serialization stream created. The sample will allocate the array of [1024*1024] but that is for demo purposes to have something to serialize. This code serializes in a buffered manner, chunk by chunk, using the SQL Server BLOB recommended update size of 8040 bytes at a time.
All you need is .NET Framework 4.5 and streaming. Let's assume we have a big file on HDD and we want to upload this file.
SQL code:
CREATE TABLE BigFiles
(
[BigDataID] [int] IDENTITY(1,1) NOT NULL,
[Data] VARBINARY(MAX) NULL
)
C# code:
using (FileStream sourceStream = new FileStream(filePath, FileMode.Open))
{
using (SqlCommand cmd = new SqlCommand(string.Format("UPDATE BigFiles SET Data=@Data WHERE BigDataID = @BigDataID"), _sqlConn))
{
cmd.Parameters.AddWithValue("@Data", sourceStream);
cmd.Parameters.AddWithValue("@BigDataID", entryId);
cmd.ExecuteNonQuery();
}
}
Works good for me. I have successfully uploaded the file of 400 mb, while MemoryStream throwed an exception when I tried to load this file into memory.
UPD: This code works on Windows 7, but failed on Windows XP and 2003 Server.
You can always write to SQL Server at a lower level using the over the wire protocol TDS (tabular data stream) that Microsoft has used since day one. They are unlikely to change it any time soon as even SQLAzure uses it!
You can see source code of how this works from the Mono project and from the freetds project
Check out the tds_blob
http://www.mono-project.com/TDS_Generic
http://www.mono-project.com/SQLClient
http://www.freetds.org/
What does the graph look like?
One problem here is the stream; the SQL 2005 requirement is a pain, as otherwise you could write directly to SqlFileStream
, however, I don't think it would be too hard to write your own Stream
implementation that buffers 8040 (or some multiple) bytes and writes it incrementally. However, I'm not sure that it is worth this extra complexity - I would be hugely tempted to just use a file as the scratch buffer and then (once serialized) loop over the file inserting/appending chunks. I don't think that the file system is going to hurt your overall performance here, and it will save you starting to write doomed data - i.e. you don't talk to the database until you already know what data you want to write. It will also help you minimise the time the connection is open.
The next problem is the serialization itself. Personally I don't recommend using BinaryFormatter
to write to persistent stores (only for transport), since it is implementation specific both in the encoder itself, and in your types (i.e. it is brittle if you make innocent-looking changes to your data types).
If your data can be represented sufficiently as a tree (rather than a full graph), I would be very tempted to try protocol buffers / protobuf-net. This encoding (devised by Google) is smaller than the BinaryFormatter
output, faster both for read and write, and is contract-based rather than field-based, so you can reliably rehydrate it again later (even if you switch platform entirely).
The default options mean that it has to write the object-length before each object (which might be expensive in your case), but if you have nested lists of large (deep) objects you can use grouped encoding to avoid this need - allowing it to write the stream in a forwards-only, single-pass way; here's a brief simple example using grouped encoding, but if you want to throw a more complex scenario at me, just let me know...
using System;
using System.Collections.Generic;
using System.IO;
using ProtoBuf;
[ProtoContract]
public class Foo {
private readonly List<Bar> bars = new List<Bar>();
[ProtoMember(1, DataFormat = DataFormat.Group)]
public List<Bar> Bars { get { return bars;}}
}
[ProtoContract]
public class Bar {
[ProtoMember(1)]
public int Id { get; set; }
[ProtoMember(2)]
public string Name { get; set; }
}
static class Program {
static void Main() {
var obj = new Foo { Bars = {
new Bar { Id = 123, Name = "abc"},
new Bar { Id = 456, Name = "def"},
} };
// write it and show it
using (MemoryStream ms = new MemoryStream()) {
Serializer.Serialize(ms, obj);
Console.WriteLine(BitConverter.ToString(ms.ToArray()));
}
}
}
Note: I do have some theories on how to hack Google's wire format to support full graphs, but it is going to need some time to try it. Oh, re the "very large arrays" - for primitive types (not objects) yuo can use "packed" encoding for this; [DataMember(..., Options = MemberSerializationOptions.Packed)]
- might be useful, but hard to say without visibility of your model.