How to analyse contents of binary serialization stream?
Because it is maybe of interest for someone I decided to do this post about What does the binary format of serialized .NET objects look like and how can we interpret it correctly?
I have based all my research on the .NET Remoting: Binary Format Data Structure specification.
Example class:
To have a working example, I have created a simple class called A
which contains 2 properties, one string and one integer value, they are called SomeString
and SomeValue
.
Class A
looks like this:
[Serializable()]
public class A
{
public string SomeString
{
get;
set;
}
public int SomeValue
{
get;
set;
}
}
For the serialization I used the BinaryFormatter
of course:
BinaryFormatter bf = new BinaryFormatter();
StreamWriter sw = new StreamWriter("test.txt");
bf.Serialize(sw.BaseStream, new A() { SomeString = "abc", SomeValue = 123 });
sw.Close();
As can be seen, I passed a new instance of class A
containing abc
and 123
as values.
Example result data:
If we look at the serialized result in an hex editor, we get something like this:
Let us interpret the example result data:
According to the above mentioned specification (here is the direct link to the PDF: [MS-NRBF].pdf) every record within the stream is identified by the RecordTypeEnumeration
. Section 2.1.2.1 RecordTypeNumeration
states:
This enumeration identifies the type of the record. Each record (except for MemberPrimitiveUnTyped) starts with a record type enumeration. The size of the enumeration is one BYTE.
SerializationHeaderRecord:
So if we look back at the data we got, we can start interpreting the first byte:
As stated in 2.1.2.1 RecordTypeEnumeration
a value of 0
identifies the SerializationHeaderRecord
which is specified in 2.6.1 SerializationHeaderRecord
:
The SerializationHeaderRecord record MUST be the first record in a binary serialization. This record has the major and minor version of the format and the IDs of the top object and the headers.
It consists of:
- RecordTypeEnum (1 byte)
- RootId (4 bytes)
- HeaderId (4 bytes)
- MajorVersion (4 bytes)
- MinorVersion (4 bytes)
With that knowledge we can interpret the record containing 17 bytes:
00
represents the RecordTypeEnumeration
which is SerializationHeaderRecord
in our case.
01 00 00 00
represents the RootId
If neither the BinaryMethodCall nor BinaryMethodReturn record is present in the serialization stream, the value of this field MUST contain the ObjectId of a Class, Array, or BinaryObjectString record contained in the serialization stream.
So in our case this should be the ObjectId
with the value 1
(because the data is serialized using little-endian) which we will hopefully see again ;-)
FF FF FF FF
represents the HeaderId
01 00 00 00
represents the MajorVersion
00 00 00 00
represents the MinorVersion
BinaryLibrary:
As specified, each record must begin with the RecordTypeEnumeration
. As the last record is complete, we must assume that a new one begins.
Let us interpret the next byte:
As we can see, in our example the SerializationHeaderRecord
it is followed by the BinaryLibrary
record:
The BinaryLibrary record associates an INT32 ID (as specified in [MS-DTYP] section 2.2.22) with a Library name. This allows other records to reference the Library name by using the ID. This approach reduces the wire size when there are multiple records that reference the same Library name.
It consists of:
- RecordTypeEnum (1 byte)
- LibraryId (4 bytes)
- LibraryName (variable number of bytes (which is a
LengthPrefixedString
))
As stated in 2.1.1.6 LengthPrefixedString
...
The LengthPrefixedString represents a string value. The string is prefixed by the length of the UTF-8 encoded string in bytes. The length is encoded in a variable-length field with a minimum of 1 byte and a maximum of 5 bytes. To minimize the wire size, length is encoded as a variable-length field.
In our simple example the length is always encoded using 1 byte
. With that knowledge we can continue the interpretation of the bytes in the stream:
0C
represents the RecordTypeEnumeration
which identifies the BinaryLibrary
record.
02 00 00 00
represents the LibraryId
which is 2
in our case.
Now the LengthPrefixedString
follows:
42
represents the length information of the LengthPrefixedString
which contains the LibraryName
.
In our case the length information of 42
(decimal 66) tell's us, that we need to read the next 66 bytes and interpret them as the LibraryName
.
As already stated, the string is UTF-8
encoded, so the result of the bytes above would be something like: _WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
ClassWithMembersAndTypes:
Again, the record is complete so we interpret the RecordTypeEnumeration
of the next one:
05
identifies a ClassWithMembersAndTypes
record. Section 2.3.2.1 ClassWithMembersAndTypes
states:
The ClassWithMembersAndTypes record is the most verbose of the Class records. It contains metadata about Members, including the names and Remoting Types of the Members. It also contains a Library ID that references the Library Name of the Class.
It consists of:
- RecordTypeEnum (1 byte)
- ClassInfo (variable number of bytes)
- MemberTypeInfo (variable number of bytes)
- LibraryId (4 bytes)
ClassInfo:
As stated in 2.3.1.1 ClassInfo
the record consists of:
- ObjectId (4 bytes)
- Name (variable number of bytes (which is again a
LengthPrefixedString
)) - MemberCount(4 bytes)
- MemberNames (which is a sequence of
LengthPrefixedString
's where the number of items MUST be equal to the value specified in theMemberCount
field.)
Back to the raw data, step by step:
01 00 00 00
represents the ObjectId
. We've already seen this one, it was specified as the RootId
in the SerializationHeaderRecord
.
0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41
represents the Name
of the class which is represented by using a LengthPrefixedString
. As mentioned, in our example the length of the string is defined with 1 byte so the first byte 0F
specifies that 15 bytes must be read and decoded using UTF-8. The result looks something like this: StackOverFlow.A
- so obviously I used StackOverFlow
as name of the namespace.
02 00 00 00
represents the MemberCount
, it tell's us that 2 members, both represented with LengthPrefixedString
's will follow.
Name of the first member:
1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64
represents the first MemberName
, 1B
is again the length of the string which is 27 bytes in length an results in something like this: <SomeString>k__BackingField
.
Name of the second member:
1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64
represents the second MemberName
, 1A
specifies that the string is 26 bytes long. It results in something like this: <SomeValue>k__BackingField
.
MemberTypeInfo:
After the ClassInfo
the MemberTypeInfo
follows.
Section 2.3.1.2 - MemberTypeInfo
states, that the structure contains:
- BinaryTypeEnums (variable in length)
A sequence of BinaryTypeEnumeration values that represents the Member Types that are being transferred. The Array MUST:
Have the same number of items as the MemberNames field of the ClassInfo structure.
Be ordered such that the BinaryTypeEnumeration corresponds to the Member name in the MemberNames field of the ClassInfo structure.
- AdditionalInfos (variable in length), depending on the
BinaryTpeEnum
additional info may or may not be present.
| BinaryTypeEnum | AdditionalInfos |
|----------------+--------------------------|
| Primitive | PrimitiveTypeEnumeration |
| String | None |
So taking that into consideration we are almost there...
We expect 2 BinaryTypeEnumeration
values (because we had 2 members in the MemberNames
).
Again, back to the raw data of the complete MemberTypeInfo
record:
01
represents the BinaryTypeEnumeration
of the first member, according to 2.1.2.2 BinaryTypeEnumeration
we can expect a String
and it is represented using a LengthPrefixedString
.
00
represents the BinaryTypeEnumeration
of the second member, and again, according to the specification, it is a Primitive
. As stated above, Primitive
's are followed by additional information, in this case a PrimitiveTypeEnumeration
. That's why we need to read the next byte, which is 08
, match it with the table stated in 2.1.2.3 PrimitiveTypeEnumeration
and be surprised to notice that we can expect an Int32
which is represented by 4 bytes, as stated in some other document about basic datatypes.
LibraryId:
After the MemerTypeInfo
the LibraryId
follows, it is represented by 4 bytes:
02 00 00 00
represents the LibraryId
which is 2.
The values:
As specified in 2.3 Class Records
:
The values of the Members of the Class MUST be serialized as records that follow this record, as specified in section 2.7. The order of the records MUST match the order of MemberNames as specified in the ClassInfo (section 2.3.1.1) structure.
That's why we can now expect the values of the members.
Let us look at the last few bytes:
06
identifies an BinaryObjectString
. It represents the value of our SomeString
property (the <SomeString>k__BackingField
to be exact).
According to 2.5.7 BinaryObjectString
it contains:
- RecordTypeEnum (1 byte)
- ObjectId (4 bytes)
- Value (variable length, represented as a
LengthPrefixedString
)
So knowing that, we can clearly identify that
03 00 00 00
represents the ObjectId
.
03 61 62 63
represents the Value
where 03
is the length of the string itself and 61 62 63
are the content bytes that translate to abc
.
Hopefully you can remember that there was a second member, an Int32
. Knowing that the Int32
is represented by using 4 bytes, we can conclude, that
must be the Value
of our second member. 7B
hexadecimal equals 123
decimal which seems to fit our example code.
So here is the complete ClassWithMembersAndTypes
record:
MessageEnd:
Finally the last byte 0B
represents the MessageEnd
record.
Vasiliy is right in that I will ultimately need to implement my own formatter/serialization process to better handle versioning and to output a much more compact stream (before compression).
I did want to understand what was happening in the stream, however, so I wrote up a (relatively) quick class that does what I wanted:
- parses its way through the stream, building a collections of object names, counts and sizes
- once done, outputs a quick summary of what it found - classes, counts and total sizes in the stream
It's not useful enough for me to put it somewhere visible like codeproject, so I just dumped the project in a zip file on my website: http://www.architectshack.com/BinarySerializationAnalysis.ashx
In my specific case it turns out that the problem was twofold:
- The BinaryFormatter is VERY verbose (this is known, I just didn't realize the extent)
- I did have issues in my class, it turned out I was storing objects that I didn't want
Hope this helps someone at some point!
Update: Ian Wright contacted me with a problem with the original code, where it crashed when the source object(s) contained "decimal" values. This is now corrected, and I've used the occasion to move the code to GitHub and give it a (permissive, BSD) license.
Our application operates massive data. It can take up to 1-2 GB of RAM, like your game. We met same "storing multiple copies of the same objects" problem. Also binary serialization stores too much meta data. When it was first implemented the serialized file took about 1-2 GB. Nowadays I managed to decrease the value - 50-100 MB. What did we do.
The short answer - do not use the .Net binary serialization, create your own binary serialization mechanism. We have own BinaryFormatter class, and ISerializable interface (with two methods Serialize, Deserialize).
Same object should not be serialized more than once. We save it's unique ID and restore the object from cache.
I can share some code if you ask.
EDIT: It seems you are correct. See the following code - it proves I was wrong.
[Serializable]
public class Item
{
public string Data { get; set; }
}
[Serializable]
public class ItemHolder
{
public Item Item1 { get; set; }
public Item Item2 { get; set; }
}
public class Program
{
public static void Main(params string[] args)
{
{
Item item0 = new Item() { Data = "0000000000" };
ItemHolder holderOneInstance = new ItemHolder() { Item1 = item0, Item2 = item0 };
var fs0 = File.Create("temp-file0.txt");
var formatter0 = new BinaryFormatter();
formatter0.Serialize(fs0, holderOneInstance);
fs0.Close();
Console.WriteLine("One instance: " + new FileInfo(fs0.Name).Length); // 335
//File.Delete(fs0.Name);
}
{
Item item1 = new Item() { Data = "1111111111" };
Item item2 = new Item() { Data = "2222222222" };
ItemHolder holderTwoInstances = new ItemHolder() { Item1 = item1, Item2 = item2 };
var fs1 = File.Create("temp-file1.txt");
var formatter1 = new BinaryFormatter();
formatter1.Serialize(fs1, holderTwoInstances);
fs1.Close();
Console.WriteLine("Two instances: " + new FileInfo(fs1.Name).Length); // 360
//File.Delete(fs1.Name);
}
}
}
Looks like BinaryFormatter
uses object.Equals to find same objects.
Have you ever looked inside the generated files? If you open "temp-file0.txt" and "temp-file1.txt" from the code example you'll see it has lots of meta data. That's why I recommended you to create your own serialization mechanism.
Sorry for being cofusing.