How to implement serialization in C++

Whenever I find myself needing to serialize objects in a C++ program, I fall back to this kind of pattern:

class Serializable {
  public:
    static Serializable *deserialize(istream &is) {
        int id;
        is >> id;
        switch(id) {
          case EXAMPLE_ID:
            return new ExampleClass(is);
          //...
        }
    }

    void serialize(ostream &os) {
        os << getClassID();
        serializeMe(os);
    }

  protected:
    int getClassID()=0;
    void serializeMe(ostream &os)=0;
};

The above works pretty well in practice. However, I've heard that this kind of switching over class IDs is evil and an antipattern; what's the standard, OO-way of handling serialization in C++?

Using something like Boost Serialization, while by no means a standard, is a (for the most part) very well written library that does the grunt work for you.

The last time I had to manually parse a predefined record structure with a clear inheritance tree, I ended up using the factory pattern with registrable classes (i.e. Using a map of key to a (template) creator function rather than a lot of switch functions) to try and avoid the issue you were having.

EDIT
A basic C++ implementation of a object factory mentioned in the above paragraph.

/**
* A class for creating objects, with the type of object created based on a key
* 
* @param K the key
* @param T the super class that all created classes derive from
*/
template<typename K, typename T>
class Factory { 
private: 
    typedef T *(*CreateObjectFunc)();

    /**
    * A map keys (K) to functions (CreateObjectFunc)
    * When creating a new type, we simply call the function with the required key
    */
    std::map<K, CreateObjectFunc> mObjectCreator;

    /**
    * Pointers to this function are inserted into the map and called when creating objects
    *
    * @param S the type of class to create
    * @return a object with the type of S
    */
    template<typename S> 
    static T* createObject(){ 
        return new S(); 
    }
public:

    /**
    * Registers a class to that it can be created via createObject()
    *
    * @param S the class to register, this must ve a subclass of T
    * @param id the id to associate with the class. This ID must be unique
    */ 
    template<typename S> 
    void registerClass(K id){ 
        if (mObjectCreator.find(id) != mObjectCreator.end()){ 
            //your error handling here
        }
        mObjectCreator.insert( std::make_pair<K,CreateObjectFunc>(id, &createObject<S> ) ); 
    }

    /**
    * Returns true if a given key exists
    *
    * @param id the id to check exists
    * @return true if the id exists
    */
    bool hasClass(K id){
        return mObjectCreator.find(id) != mObjectCreator.end();
    } 

    /**
    * Creates an object based on an id. It will return null if the key doesn't exist
    *
    * @param id the id of the object to create
    * @return the new object or null if the object id doesn't exist
    */
    T* createObject(K id){
        //Don't use hasClass here as doing so would involve two lookups
        typename std::map<K, CreateObjectFunc>::iterator iter = mObjectCreator.find(id); 
        if (iter == mObjectCreator.end()){ 
            return NULL;
        }
        //calls the required createObject() function
        return ((*iter).second)();
    }
};

Serialization is a touchy topic in C++...

Quick question:

Serialization: short-lived structure, one encoder/decoder
Messaging: longer life, encoders / decoders in multiple languages

The 2 are useful, and have their use.

Boost.Serialization is the most recommended library for serialization usually, though the odd choice of operator& which serializes or deserializes depending on the const-ness is really an abuse of operator overloading for me.

For messaging, I would rather suggest Google Protocol Buffer. They offer a clean syntax for describing the message and generate encoders and decoders for a huge variety of languages. There are also one other advantage when performance matters: it allows lazy deserialization (ie only part of the blob at once) by design.

Moving on

Now, as for the details of implementation, it really depends on what you wish.

You need versioning, even for regular serialization, you'll probably need backward compatibility with the previous version anyway.
You may, or may not, need a system of tag + factory. It's only necessary for polymorphic class. And you will need one factory per inheritance tree (kind) then... the code can be templatized of course!
Pointers / References are going to bite you in the ass... they reference a position in memory that changes after deserialization. I usually choose a tangent approach: each object of each kind is given an id, unique for its kind, and so I serialize the id rather than a pointer. Some framework handles it as long as you don't have circular dependency and serialize the objects pointed to / referenced first.

Personally, I tried as much as I can to separate the code of serialization / deserialization from the actual code that runs the class. Especially, I try to isolate it in the source files so that changes on this part of the code does not annihilate the binary compatibility.

On versioning

I usually try to keep serialization and deserialization of one version close together. It's easier to check that they are truly symmetric. I also try to abstract the versioning handling directly in my serialization framework + a few other things, because DRY should be adhered to :)

On error-handling

To ease error-detection, I usually use a pair of 'markers' (special bytes) to separate one object from another. It allows me to immediately throw during deserialization because I can detect a problem of desynchronization of the stream (ie, somewhat ate too much bytes or did not ate sufficiently).

If you want permissive deserialization, ie deserializing the rest of the stream even if something failed before, you'll have to move toward byte-count: each object is preceded by its byte-count and can only eat so much byte (and is expected to eat them all). This approach is nice because it allows for partial deserialization: ie you can save the part of the stream required for an object and only deserialize it if necessary.

Tagging (your class IDs) is useful here, not (only) for dispatching, but simply to check that you are actually deserializing the right type of object. It also allows for pretty error messages.

Here are some error messages / exceptions you may wish:

No version X for object TYPE: only Y and Z
Stream is corrupted: here are the next few bytes BBBBBBBBBBBBBBBBBBB
TYPE (version X) was not completely deserialized
Trying to deserialize a TYPE1 in TYPE2

Note that as far as I remember both Boost.Serialization and protobuf really help for error/version handling.

protobuf has some perks too, because of its capacity of nesting messages:

the byte-count is naturally supported, as well as the versioning
you can do lazy deserialization (ie, store the message and only deserialize if someone asks for it)

The counterpart is that it's harder to handle polymorphism because of the fixed format of the message. You have to carefully design them for that.

Serialization is unfortunately never going to be completely painless in C++, at least not for the foreseeable future, simply because C++ lacks the critical language feature that makes easy serialization possible in other languages : reflection. That is, if you create a class Foo, C++ has no mechanism to inspect the class programatically at runtime to determine what member variables it contains.

So therefore, there is no way to create generalized serialization functions. One way or another, you have to implement a special serialization function for each class. Boost.Serialization is no different, it simply provides you with a convenient framework and a nice set of tools which help you do this.

The answer by Yacoby can be extended further.

I believe the serialization can be implemented in a way similar to managed languages if one actually implements a reflection system.

For years we've been using the automated approach.

I was one of the implementors of the working C++ postprocessor and the Reflection library: LSDC tool and Linderdaum Engine Core (iObject + RTTI + Linker/Loader). See the source at http://www.linderdaum.com

The class factory abstracts the process of class instantiation.

To initialize specific members, you might add some intrusive RTTI and autogenerate the load/save procedures for them.

Suppose, you have the iObject class at the top of your hierarchy.

// Base class with intrusive RTTI
class iObject
{
public:
    iMetaClass* FMetaClass;
};

///The iMetaClass stores the list of properties and provides the Construct() method:

// List of properties
class iMetaClass: public iObject
{
public:
    virtual iObject* Construct() const = 0;
    /// List of all the properties (excluding the ones from base class)
    vector<iProperty*> FProperties;
    /// Support the hierarchy
    iMetaClass* FSuperClass;
    /// Name of the class
    string FName;
};

// The NativeMetaClass<T> template implements the Construct() method.
template <class T> class NativeMetaClass: public iMetaClass
{
public:
    virtual iObject* Construct() const
    {
        iObject* Res = new T();
        Res->FMetaClass = this;
        return Res;
    }
};

// mlNode is the representation of the markup language: xml, json or whatever else.
// The hierarchy might have come from the XML file or JSON or some custom script
class mlNode {
public:
    string FName;
    string FValue;
    vector<mlNode*> FChildren;
};

class iProperty: public iObject {
public:
    /// Load the property from internal tree representation
    virtual void Load( iObject* TheObject, mlNode* Node ) const = 0;
    /// Serialize the property to some internal representation
    virtual mlNode* Save( iObject* TheObject ) const = 0;
};

/// function to save a single field
typedef mlNode* ( *SaveFunction_t )( iObject* Obj );

/// function to load a single field from mlNode
typedef void ( *LoadFunction_t )( mlNode* Node, iObject* Obj );

// The implementation for a scalar/iObject field
// The array-based property requires somewhat different implementation
// Load/Save functions are autogenerated by some tool.
class clFieldProperty : public iProperty {
public:
    clFieldProperty() {}
    virtual ~clFieldProperty() {}

    /// Load single field of an object
    virtual void Load( iObject* TheObject, mlNode* Node ) const {
        FLoadFunction(TheObject, Node);
    }
    /// Save single field of an object
    virtual mlNode* Save( iObject* TheObject, mlNode** Result ) const {
        return FSaveFunction(TheObject);
    }
public:
    // these pointers are set in property registration code
    LoadFunction_t FLoadFunction;
    SaveFunction_t FSaveFunction;
};

// The Loader class stores the list of metaclasses
class Loader: public iObject {
public:
    void RegisterMetaclass(iMetaClass* C) { FClasses[C->FName] = C; }
    iObject* CreateByName(const string& ClassName) { return FClasses[ClassName]->Construct(); }

    /// The implementation is an almost trivial iteration of all the properties
    /// in the metaclass and calling the iProperty's Load/Save methods for each field
    void LoadFromNode(mlNode* Source, iObject** Result);

    /// Create the tree-based representation of the object
    mlNode* Save(iObject* Source);

    map<string, iMetaClass*> FClasses;
};

When you define the ConcreteClass derived from iObject, you use some extension and the code generator tool to produce the list of save/load procedures and the registration code for.

Let us see the code for this sample.

Somewhere in the framework we have an empty formal define

#define PROPERTY(...)

/// vec3 is a custom type with implementation omitted for brevity
/// ConcreteClass2 is also omitted
class ConcreteClass: public iObject {
public:
    ConcreteClass(): FInt(10), FString("Default") {}

    /// Inform the tool about our properties
    PROPERTY(Name=Int, Type=int,  FieldName=FInt)
    /// We can also provide get/set accessors
    PROPERTY(Name=Int, Type=vec3, Getter=GetPos, Setter=SetPos)
    /// And the other field
    PROPERTY(Name=Str, Type=string, FieldName=FString)
    /// And the embedded object
    PROPERTY(Name=Embedded, Type=ConcreteClass2, FieldName=FEmbedded)

    /// public field
    int FInt;
    /// public field
    string FString;
    /// public embedded object
    ConcreteClass2* FEmbedded;

    /// Getter
    vec3 GetPos() const { return FPos; }
    /// Setter
    void SetPos(const vec3& Pos) { FPos = Pos; }
private:
    vec3 FPos;
};

The autogenerated registration code would be:

/// Call this to add everything to the linker
void Register_ConcreteClass(Linker* L) {
    iMetaClass* C = new NativeMetaClass<ConcreteClass>();
    C->FName = "ConcreteClass";

    iProperty* P;
    P = new FieldProperty();
    P->FName = "Int";
    P->FLoadFunction = &Load_ConcreteClass_FInt_Field;
    P->FSaveFunction = &Save_ConcreteClass_FInt_Field;
    C->FProperties.push_back(P);
    ... same for FString and GetPos/SetPos

    C->FSuperClass = L->FClasses["iObject"];
    L->RegisterClass(C);
}

// The autogenerated loaders (no error checking for brevity):
void Load_ConcreteClass_FInt_Field(iObject* Dest, mlNode* Val) {
    dynamic_cast<ConcereteClass*>Object->FInt = Str2Int(Val->FValue);
}

mlNode* Save_ConcreteClass_FInt_Field(iObject* Dest, mlNode* Val) {
    mlNode* Res = new mlNode();
    Res->FValue = Int2Str( dynamic_cast<ConcereteClass*>Object->FInt );
    return Res;
}
/// similar code for FString and GetPos/SetPos pair with obvious changes

Now, if you have the the JSON-like hierarchical script

Object("ConcreteClass") {
    Int 50
    Str 10
    Pos 1.5 2.2 3.3
    Embedded("ConcreteClass2") {
        SomeProp Value
    }
}

The Linker object would resolve all the classes and properties in Save/Load methods.

Sorry for the long post, the implementation grows even larger when all the error handling comes in.

Perhaps I am not clever, but I think that ultimately the same kind of code that you have written gets written, simply because C++ doesn't have the runtime mechanisms to do anything different. The question is whether it will be written bespoke by a developer, generated via template metaprogramming (which is what I suspect that boost.serialization does), or generated via some external tool like an IDL compiler/code generator.

The question of which of those three mechanisms (and maybe there are other possibilities, too) is something that should be evaluated on a per-project basis.

How to implement serialization in C++

Related

Recent Posts