EntityBase2.cs (.Net 1.1) serialization

Posts   
1  /  2  /  3
 
    
simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 24-Jul-2006 08:34:19   

This class serializes _savedFields but during deserialization sets it to null afterdeserializing it:


    _savedFields = (Hashtable)info.GetValue("_savedFields", typeof(Hashtable));

   ....

    _savedFields = null;

The .NET 2.0 version doesn't serialize (or deserialize!) it, but just sets it to null. (which I think you mentioned once was the required behaviour)

Cheers Simon

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39588
Joined: 17-Aug-2003
# Posted on: 24-Jul-2006 09:14:28   

Hmm.

I think both are flawed. At first it was designed as: no savedfields should be transfered across a wire, as it's context related data. However, in webforms where you store entities in a viewstate for example, you could need savefields.

Good catch. I'll file it as a bug (so the fields should be serialized)

Frans Bouma | Lead developer LLBLGen Pro
simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 24-Jul-2006 11:00:59   

Otis wrote:

Hmm.

I think both are flawed. At first it was designed as: no savedfields should be transfered across a wire, as it's context related data. However, in webforms where you store entities in a viewstate for example, you could need savefields.

Good catch. I'll file it as a bug (so the fields should be serialized)

Weeeell. While you're looking at serialization smile ........

I've been looking into optimizing serialization for large-data cases:

Scenario: Retrieve a collection containing 34,423 rows from a reference table with 15 columns. (.Net 1.1) Retrieval speed is fantastic - around 2 seconds. Serialization (using a Binary formatter to a MemoryStream) is poor - around 90 seconds due to a 13,608,188 byte lump of data (Not the fault of LLBLGen of course)

I have a simple check in GetObjectData to see if this is a simple case or not. If not then do the default thing otherwise we can reduce the overall size somewhere.

First optimization is the EntityFieldsDataArray - if these entities are freshly read from the database and have not been modified then we are storing the same data twice. Therefore we can serialize just half of the data.

Second optimization is to not store the booleans (which will always be false according to the simple test) and most of the objects (which will be null due to the simple test again)

Using these two optimizations saved nearly 25% in size (10,279,958 bytes) and nearly 28% serialization time

Further optimizations are:- I have a boolean which stores whether the serialized data is 'simple' or not. This could be changed to store bit flags instead. It could then be used to optimize the non-simple serialization by only storing non-default objects.

Storing the Name seems superfluous - I can't imagine this ever changing. It should be possible to change the Entity template to set _name in the serialization constructor and have a _defaultName property (which wouldn't take up any more space since the string is interned anyway?) to see if the name has changed - if it has then save it and set the relevant bit flag. This gets us down to 10,107,823 bytes.

ObjectID look intriguing - in this scenario, there are no related entities and so the objectID GUID could be not serialized and a new one generated in the constructor. Not sure how safe this assumption is (Frans?) and I think that creating GUIDs is relatively expensive time-wise. However, it would get us down to 9,247,158 bytes (32.05% saving!)

There might be some optimization on the FieldFlags which are stored in a BitArray. This sounds fairly optimal: using 30 bits (2 bits for each column - one is IsNull and the other is IsChanged) instead of 30 booleans (each 4 bytes?). But we already know that the IsChanged will be false so we could reduce this by half perhaps? Just for fun, I tried without storing this object and saved 1,411,445 bytes! This means that each BitArray is taking 41 bytes rather than 30 bits.

More work to be done here perhaps. Is there an safe way of generating the IsNull information from the data itself and the schema. It might be as simple as comparing the data to null and setting the bit but I would be grateful, Frans, if you could confirm that?

Here is the serialization code I have so far:


            EntityFields2 fields = (EntityFields2) _fields;
            if (fields.IsDirty || _fields.State != EntityState.Fetched || _isNew || _isDeleted || 
                _validator != null || _relatedEntitySyncInfos != null || _field2RelatedEntity != null || 
                _concurrencyPredicateFactoryToUse != null || _savedFields != null || 
                _dataErrorInfoError != null || _dataErrorInfoErrorsPerField != null)
            {
                info.AddValue("_simple", false);
                // This is the standard serialization
                info.AddValue("_fieldsData", ((EntityFields2)_fields).GetFieldsDataArray());
                info.AddValue("_fieldsFlags", ((EntityFields2)_fields).GetFieldsTrackingFlagsArray());
                info.AddValue("_fieldsState", _fields.State);
                info.AddValue("_fieldsIsDirty", _fields.IsDirty);
                info.AddValue("_name", _name);
                info.AddValue("_isNew", _isNew);
                info.AddValue("_isDeleted", _isDeleted);
                info.AddValue("_validator", _validator);
                info.AddValue("_objectID", _objectID);
                info.AddValue("_relatedEntitySyncInfos", _relatedEntitySyncInfos);
                info.AddValue("_field2RelatedEntity", _field2RelatedEntity);
                info.AddValue("_concurrencyPredicateFactoryToUse", _concurrencyPredicateFactoryToUse);
                info.AddValue("_savedFields", _savedFields);
                info.AddValue( "_dataErrorInfoError", _dataErrorInfoError );
                info.AddValue( "_dataErrorInfoErrorsPerField", _dataErrorInfoErrorsPerField );
            } else
            {
                object[,] data = ((EntityFields2) _fields).GetFieldsDataArray();
                int l = data.Length / 2;
                object[] halfData = new object[l];
                for(int i = 0; i < l; i++)
                {
                    halfData[i] = data[0, i];
                }

                info.AddValue("_simple", true);
                info.AddValue("_fieldsData", halfData);
                info.AddValue("_fieldsFlags", ((EntityFields2)_fields).GetFieldsTrackingFlagsArray());
                //info.AddValue("_name", _name);
                //info.AddValue("_objectID", _objectID);
            }

Cheers Simon

simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 24-Jul-2006 11:47:14   

Another possible optimization: If I serialize an array of entities rather than the collection itself, I save over 2MB and a lot of time (especially on the Deserialize)!

Serialize Collection: 71.33 seconds Total size: 9,247,158 bytes Deserialize Collection: 58.34 seconds

Serialize Entity Array: 51.84 seconds Total size: 7,123,699 bytes Deserialize Entity Array: 6.56 seconds

Cheers Simon

mikeg22
User
Posts: 411
Joined: 30-Jun-2005
# Posted on: 24-Jul-2006 19:21:24   

How do you serialize an array rather than an entitycollection? Do you bring over the private members of the collection like _entityFactoryToUser and _containingEntity?

mikeg22
User
Posts: 411
Joined: 30-Jun-2005
# Posted on: 24-Jul-2006 20:24:16   

Our application routinely has to serialize > 4000 entities, so these issues with performance are problably my #1 concern right now. I'm looking at using the serialization framework described at http://www.codeproject.com/csharp/CompactSerialization.asp

Does anyone have any experience with this or would know if this is likely to help? From what I understand, the idea is to avoid reflection calls and reduce the data size by not sending type information over the wire when it is unnecessary.

simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 24-Jul-2006 20:38:18   

mikeg22 wrote:

How do you serialize an array rather than an entitycollection? Do you bring over the private members of the collection like _entityFactoryToUser and _containingEntity?

Nothing special: I just copied the contents of the collection into an array and then tried to serialize that instead - something like...

myEntity[] entityArray = new myEntity[myEntityCollection.Count]; for(int i = 0; i < myEntityCollection.Count; i++) { entityArray[i] = myEntityCollection[i]; }

... and then serialize the array (would need a collection creating on the other end to put these into)

Cheers Simon

simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 24-Jul-2006 23:34:30   

mikeg22 wrote:

Our application routinely has to serialize > 4000 entities, so these issues with performance are problably my #1 concern right now. I'm looking at using the serialization framework described at http://www.codeproject.com/csharp/CompactSerialization.asp

Does anyone have any experience with this or would know if this is likely to help? From what I understand, the idea is to avoid reflection calls and reduce the data size by not sending type information over the wire when it is unnecessary.

I was looking at this also but haven't done anything with it as yet. Let me know if you get anywhere with it. Every tweak helps!

The other thing I was considering investigating is a bit more radical: (This is based on the simple retrieval of a large collection of entities - prefetch paths may be possibly as well but this is a start) When the DataAccessAdapter is located on a remote machine, it seems wasteful to create the entities there and then have to serialize them so they can be reconstituted on the client machine. We already know this is unusably slow for large collections due to the involvement of serialization. Since DataAccessAdapterBase is already on the client machine, it should be possible to derive a client-side DataAccessAdapter class which is capable of getting the raw data from the server-side and generating the entities locally where they are actually required - overriding the ExecuteMultiRowRetrievalQuery() method and modifying it so that it can obtain an alternative IDataReader is where I am looking.

This client-side DataAccessAdapter can talk to the remote DataAccessAdapter (the one with the real database connections) and get back a remote reference to a 'ResultSet' type object. The server side will return the reference immediately and can asynchronously queue up all of the retrieved raw data for the query into this object. Meanwhile the client side will (and I've not worked out exactly how just yet) be able to put a wrapper around this returned object which implements the IDataReader required. It can then merrily be generating the entities from this queued raw data. The wrapper could retrieve blocks of data rows which is more palettable to remoting (say 500 or 1000 at a time) either on request or possibly using a background thread. Benefits are: - Entities are only created on the client-side. - Remoting can cope with the smaller amounts of data. - Data retrieval and entity creation can occur simultaneously. - Server memory is much less since only raw data is stored and even then, only for the time it takes for the client to pull it. - The client can put the data into an existing collection if required which is normally more problematic with remoting since the pass by value mechanism creates a new collection. - Since we are just using an alternate IDataReader, hopefully all of the inheritance hierachy code will be none the wiser and just work without any changes. - Transparent to the client-code - if having a DataAccessAdapter directly on the client is or becomes possible (and it isn't at my current clients which is why we have to remote to an HTTP server) then this step can be bypassed possibly by a configuration change, and the entities generated in-process which is blazingly fast.

Since 2 seconds is all that it takes for LLBLGen to produce the query, send it across the network, be parsed by SQL Server, retrieve the data from disk, send it back across the network to LLBLGen and the entities created, I am hopeful that another 2 seconds for just another network hop might be feasible.

Cheers Simon

mikeg22
User
Posts: 411
Joined: 30-Jun-2005
# Posted on: 25-Jul-2006 00:23:24   

From performance profiling, it doesn't seem like the big performance hits come from the creation of the entity graph in the DAA, the GetObjectData code, or the serialization constructors. Strangely, it seems that most of the processing is going on somewhere in the remoting sinks.

In an earlier thread, I hypothesized (guessed with almost no evidence simple_smile ) that the remoting sinks are spending a huge amount of time resolving "multiple instance of same object" situations, which are all over the place in an LLBLGEN entity graph. This is just a guess based on the fact that the time requirements for serialization seem to rise in an O(N^2) rate....that is, 5000 entities takes 2 seconds, 10000 entities take 8 seconds, 20000 entities takes a minute, etc.

If this truly is the problem with serializing large graphs, the problem could be avoided by only adding each instance of an entity in the graph to the "info" SerializationInfo object once instead of every time it is referenced in the object graph. So, instead of adding the actual Entity to the info object, a unique object identifier would be added, and resolved when the graph is being deserialized.

Anyways, this is something that would require a bit of effort to do, but I just wanted to give you my best thinking on this (realy bad) performance problem... cry

simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 25-Jul-2006 07:52:54   

mikeg22 wrote:

From performance profiling, it doesn't seem like the big performance hits come from the creation of the entity graph in the DAA, the GetObjectData code, or the serialization constructors. Strangely, it seems that most of the processing is going on somewhere in the remoting sinks.

In an earlier thread, I hypothesized (guessed with almost no evidence simple_smile ) that the remoting sinks are spending a huge amount of time resolving "multiple instance of same object" situations, which are all over the place in an LLBLGEN entity graph. This is just a guess based on the fact that the time requirements for serialization seem to rise in an O(N^2) rate....that is, 5000 entities takes 2 seconds, 10000 entities take 8 seconds, 20000 entities takes a minute, etc.

If this truly is the problem with serializing large graphs, the problem could be avoided by only adding each instance of an entity in the graph to the "info" SerializationInfo object once instead of every time it is referenced in the object graph. So, instead of adding the actual Entity to the info object, a unique object identifier would be added, and resolved when the graph is being deserialized.

Anyways, this is something that would require a bit of effort to do, but I just wanted to give you my best thinking on this (realy bad) performance problem... cry

Thats interesting. How is the time split between serializing the entities and processing them through the sinks?

I have a copy of Ingo Rammer's "Advanced .NET Remoting" in front of me - I wonder if I can hack one of the samples to get a better idea of how it works (and possibly how to give Entity collections 'special' VIP treatment)

Cheers Simon

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39588
Joined: 17-Aug-2003
# Posted on: 25-Jul-2006 09:41:37   

simmotech wrote:

mikeg22 wrote:

How do you serialize an array rather than an entitycollection? Do you bring over the private members of the collection like _entityFactoryToUser and _containingEntity?

Nothing special: I just copied the contents of the collection into an array and then tried to serialize that instead - something like...

myEntity[] entityArray = new myEntity[myEntityCollection.Count]; for(int i = 0; i < myEntityCollection.Count; i++) { entityArray[ i] = myEntityCollection[ i]; }

... and then serialize the array (would need a collection creating on the other end to put these into) Cheers Simon

psst ((ICollection)myEntityCollection).CopyTo(...) wink

Great suggestions so far simple_smile

One thing to realize is that the soap formatter already uses reference markers towards a single object definition, so I suspect that the binary formatter does the same: one object definition and every reference to that same object simply gets a marker / pointerID instead of the same object.

What I've found out is that it's way slower to serialize / deserialize a lot of small objects than to serialize/deserialize a couple of big objects, even if it's the same amount of data. the sad thing is: unless you're creating your own protocol and own formatter object (you could do that, it's perhaps more efficient), you have to work with objects and thus with the penalty you have to pay when deserializing them.

The 'build resultset on the server and pass it to the client and instantiate entities there' is a nice idea, though you still have to marshall the resultset, so unless you create your own formatter as well, it's not that helpful I'm afraid.

The idea of how entities are serialized now is to bring down the # of objects, so instead of entity field objects with objects inside them, simply the values and the original values are stored, as well as some flags in a bitarray. I'm suprised the bitarray takes that amount of space though!

There are several commercial remoting frameworks on the market which effectively offer you a different formatter. What I didn't want is to tie the code to a special formatter, as that would eliminate the usage of other formatters. However, if there's data sent which is known on the client anyway, it's of course not necessary to embed that data. If I look at the set of data currently packed in an adapter entity when its send over the wire, I don't see how I can cut out some data, without adding some flags: the thing is: you can only remove voids in the data-arrays if you have an index array with it which illustrates which field index is presented by the value at position n in the data array.

You also can't cut out data which seems static without adding code elsewhere to set it at deserialization time. That might sound simple, but after release, as we're now, if you have to change templates AND runtimes both to add a feature/fix, it will break a lot of stuff because people either upgrade the templates and not the runtimes, or the runtimes and not the templates etc.

ObjectID is required, as otherwise you'll get a new one on the server and also when the entity is send back. Then you can't use a context to find back an original instance. What might be an optimization is perhaps to store the bytes of the GUID instead of the serialization output of the GUID, but I'm not sure if that would make a huge impact.

You indeed could re-create the IsNull flags from the data! simple_smile Good idea. IsNull is false when IsNew is true IsNull is true when DbValue of a field is null and IsNew is false.

The bitflags for fields could be done with a set of longs and bitvoodoo. I thought .NET would do that for me in a bitarray. What's the purpose of a bitarray if it uses a byte per bit anyway simple_smile . The DataTable binary serialization uses a bitarray as well btw, I 'learned' the idea from that code wink

Frans Bouma | Lead developer LLBLGen Pro
simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 25-Jul-2006 13:29:55   

Otis wrote:

The bitflags for fields could be done with a set of longs and bitvoodoo. I thought .NET would do that for me in a bitarray. What's the purpose of a bitarray if it uses a byte per bit anyway simple_smile . The DataTable binary serialization uses a bitarray as well btw, I 'learned' the idea from that code wink

Just looked into this a bit more: In memory everything indeed is stored compactly. However, BitArray doesn't implement ISerializable and so all its private members get stored including _version which is only used for enumeration anyway so thats a waste of 4 bytes per instance!

Also since it is an object not a value, the serializer seems to store a 16 byte identifier in the stream!!!

I just proved this by using BitArray.CopyTo to populate a byte[] and serialized that instead (BitArray supports this along with bool[] and int[]). BinaryFormatter seems to be consider this as a value type and stores it 'inline' with the entity data: saving ~30 bytes per entity!!

Seems that values are stored more compactly than objects unless that object is shared.

Next I will try the 'create your own byte[] using a MemoryStream' and store that and see if it is any better.

Cheers Simon

simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 26-Jul-2006 13:01:32   

OK, I've gone a bit further on this:

At the bottom of this message is a sample of my test GetObjectData() method thus far.

As it stands I can transparently (by changing only the serialization/deserialization code itself) get a 34.07% reduction in size. Serializing as an array of entities rather than a collection gives me a 49.67% saving.

This works by using a BitVector32 to store flags to say what information is and isn't included in the serialization stream. Only 12 of the possible 32 flags are used so there is room for expansion. Only items which need to be stored are actually stored.

I have two further optimizations (you can see them commented out in the code) which can give a 41.66% reduction (collection as-is) and a 57.26% reduction (entity[]).

The first optimization is not serializing _name. I had already mentioned this and you had reservations as it required a change to the templates also and there is a chance of a version mismatch if the develop doesn't update the templates at the same time as the runtime. Personally, I think that it is worth looking at since the worst that would happen is a compile-time error as a warning to get the latest templates/runtimes.

However, here is the plan: In EntityBase2, a new string field is added as "_defaultName". This is populated in InitClass at the same time as "_name" and with the same value. The only difference would be that _name is Read/Write and _defaultName is ReadOnly. The two can then be compared at serialization time and _name serialized only if it is different. In the Entityclass deserialization constructor, the entity name is already known by the template and so _defaultName is set here. In addition, if _name is null (ie wasn't set during deserialization) then it is also set to the same value.

The second optimization is regarding the _objectID guid. Now a quick look with Resharper shows that _objectID is only accessed internally within EntityBase2. (ObjectID the property is used externally but that shouldn't be a problem...) Currently, a new entity always gets a new Guid in InitClass(). Instead of doing this, set it to Guid.Empty and change the ObjectID property getter to create a new Guid if it is currently Empty. Thus only when an ObjectID is actually needed is one generated. (I suppose this could be considered an optimization in itself since getting a new Guid is expensive relative to setting to a known value). In addition, there a 4 or 5 places within EntityBase2 where _objectID is read. Changing these to ObjectID instead will make the above change transparent. Now we only need to serialize the Guid if it is not Empty! (Incidentally I tried your suggestion of storing a Guid as a byte[] but this is actually larger - obvious in hindsight when you consider that an int would be required to store the array size but I found an increase of 6 bytes per entity instead of 4! Go figure)

I have some other things to try: 1) Tweak EntityCollectionBase2 on the same lines as above including serializing its entities as an array. This should work well for collections that are self-contained but might need an alternative method if the graph is not just the collection and its entities alone. 2) Use the FastSerializer as mentioned above which make everything a self-contained byte[] but all within the GetObjectData() method so doesn't affect remoting. 3) Look into how fast it would be to have a specific method in a derivative of DataAccessAdapter that passes self-contained collections only via a byte[]. The serializing is then done by this method and Remoting would only see a byte[] and nothing else and so there should be a speed improvement if not necessary a size improvement since there is only one object. 4) Develop a BitVector16 class. This should save two bytes per entity. Possible downsides - only 4 bits left over and might be slower than setting bits on an int. 5) Compression - can be done either in remoting or if (3) proves successful, directly at serialization time.



        private static readonly int AreFieldsDirtyMask = BitVector32.CreateMask();
        private static readonly int IsFieldStatusNotFetchedMask = BitVector32.CreateMask(AreFieldsDirtyMask);
        private static readonly int isNameNotDefaultMask = BitVector32.CreateMask(IsFieldStatusNotFetchedMask);
        private static readonly int IsNewMask = BitVector32.CreateMask(isNameNotDefaultMask);
        private static readonly int IsDeletedMask = BitVector32.CreateMask(IsNewMask);
        private static readonly int HasValidatorMask = BitVector32.CreateMask(IsDeletedMask);
        private static readonly int HasObjectIDMask = BitVector32.CreateMask(HasValidatorMask);
        private static readonly int HasRelatedEntitySyncInfosMask = BitVector32.CreateMask(HasObjectIDMask);
        private static readonly int HasField2RelatedEntityMask = BitVector32.CreateMask(HasRelatedEntitySyncInfosMask);
        private static readonly int HasConcurrencyPredicateFactoryToUseMask = BitVector32.CreateMask(HasField2RelatedEntityMask);
        private static readonly int HasSavedFieldsMask = BitVector32.CreateMask(HasConcurrencyPredicateFactoryToUseMask);
        private static readonly int HasDataErrorInfoErrorMask = BitVector32.CreateMask(HasSavedFieldsMask);
        private static readonly int HasDataErrorInfoErrorsPerFieldMask = BitVector32.CreateMask(HasDataErrorInfoErrorMask);

        [EditorBrowsable(EditorBrowsableState.Never)]
        public virtual void GetObjectData(SerializationInfo info, StreamingContext context)
        {
            BitVector32 serializationFlags = new BitVector32();
            
            if (_fields.IsDirty)
            {

                // Set the flag so the deserialization code knows what to expect
                serializationFlags[AreFieldsDirtyMask] = true;

                // Save both current and database data
                info.AddValue("_fieldsData", ((EntityFields2) _fields).GetFieldsDataArray());

                // Save the modified flags for the fields
                BitArray fieldsFlagBits = new BitArray(_fields.Count);
                for (int i = 0; i < _fields.Count; i++) fieldsFlagBits[i] = _fields[i].IsChanged;
                byte[] fieldFlags = new byte[(fieldsFlagBits.Length + 7) / 8];
                fieldsFlagBits.CopyTo(fieldFlags, 0);
                info.AddValue("_fieldsFlags", fieldFlags);
                
            } else
            {
                
                // Just save one of the sets of values
                object[] dbData = new object[_fields.Count];
                for(int i = 0; i < _fields.Count; i++) dbData[i] = _fields[i].DbValue;
                info.AddValue("_fieldsData", dbData);
            }
            
            // Might be some further optimization here since there are, currently, just
            // four states: these could be stored in 2 (or 3 for future functionality) flag bits
            // alternatively one or more of these, such as New may not be required if the entity
            // IsNew is set - Check with Frans on this
            if (_fields.State != EntityState.Fetched)
            {
                serializationFlags[IsFieldStatusNotFetchedMask] = true;
                info.AddValue("_fieldsState", _fields.State);
            }
    
            // This optimization requires the readonly _defaultName to be set in the template
            // Frans points out that there might be a mismatch between templates and runtime
            // library versions but I don't think it will compile so would be a reasonable
            // pointer to get the latest common versions
            // Since I have not changed the template, will always serialize for now
//          if (_name != _defaultName)
//          {
                serializationFlags[isNameNotDefaultMask] = true;
                info.AddValue("_name", _name);
//          }
            
            serializationFlags[IsNewMask] = _isNew;
            serializationFlags[IsDeletedMask] = _isDeleted;
            
            if (_validator != null)
            {
                serializationFlags[HasValidatorMask] = true;
                info.AddValue("_validator", _validator);
            }

            // Possible future optimization:
            // Entity has Empty Guid and Getter on ObjectID will lazy-generate a new one
            // This means for objects created remotely and then immediately serialized back
            // to the client, we don't need a Guid at the point of creation but if one already
            // exists, it will be preserved
//          if (_objectID != Guid.Empty)
//          {
                serializationFlags[HasObjectIDMask] = true;
                info.AddValue("_objectID", _objectID);
//          }
            
            if (_relatedEntitySyncInfos != null)
            {
                serializationFlags[HasRelatedEntitySyncInfosMask] = true;
                info.AddValue("_relatedEntitySyncInfos", _relatedEntitySyncInfos);
            }

            if (_field2RelatedEntity != null)
            {
                serializationFlags[HasField2RelatedEntityMask] = true;
                info.AddValue("_field2RelatedEntity", _field2RelatedEntity);
            }
            
            if (_concurrencyPredicateFactoryToUse != null)
            {
                serializationFlags[HasConcurrencyPredicateFactoryToUseMask] = true;
                info.AddValue("_concurrencyPredicateFactoryToUse", _concurrencyPredicateFactoryToUse);
            }
            
            if (_savedFields != null)
            {
                serializationFlags[HasSavedFieldsMask] = true;
                info.AddValue("_savedFields", _savedFields);
            }
            
            if (_dataErrorInfoError != null)
            {
                serializationFlags[HasDataErrorInfoErrorMask] = true;
                info.AddValue( "_dataErrorInfoError", _dataErrorInfoError );
            }
            
            if (_dataErrorInfoErrorsPerField != null)
            {
                serializationFlags[HasDataErrorInfoErrorsPerFieldMask] = true;
                info.AddValue( "_dataErrorInfoErrorsPerField", _dataErrorInfoErrorsPerField );
            }
            
            info.AddValue("_flags", serializationFlags.Data);

            OnGetObjectData(info, context);
        }

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39588
Joined: 17-Aug-2003
# Posted on: 26-Jul-2006 13:44:57   

Looks great! simple_smile I'll try to run some tests this coming weekend simple_smile

I've to add: the combined template + runtime change combi is out of the question for now. I've done that in the past and it wasn't pleasant: people expect a clean build, if it breaks they'll ask us what's wrong. This thus creates support hell, which I won't risk over this.

The GUID optimization is a good one, didn't think of that! simple_smile .

The array approach could be done from the template though. The generated code's GetObjectData() simply adds the collections to the info object, but it could of course call a utility method which creates an array of entities from it. The only drawback is that an entity collection by itself can have state and you'll miss that by using an array, so an array is likely not an option in many scenario's.

The BitVector32 usage is elegant, I like it simple_smile It can also be used to store the IsDirty flags, just use per 32 fields a bitvector32 structure. but perhaps overkill, don't know.

Frans Bouma | Lead developer LLBLGen Pro
simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 26-Jul-2006 14:09:07   

Otis wrote:

Looks great! simple_smile I'll try to run some tests this coming weekend simple_smile I've to add: the combined template + runtime change combi is out of the question for now. I've done that in the past and it wasn't pleasant: people expect a clean build, if it breaks they'll ask us what's wrong. This thus creates support hell, which I won't risk over this.

OK, maybe a future release.

!!Actually, I've just noticed that the factory for the entity has a ForEntityName property which happens to be exactly what we need for the comparison - maybe we can use this string to compare against _name? !!

Otis wrote:

The array approach could be done from the template though. The generated code's GetObjectData() simply adds the collections to the info object, but it could of course call a utility method which creates an array of entities from it. The only drawback is that an entity collection by itself can have state and you'll miss that by using an array, so an array is likely not an option in many scenario's.

Not sure what you mean here: the collection's entities can be stored as objects as they are now or, under certain safe circumstances such as when the entities don't reference anything else and the collection is not owned by an entity, the entities can be stored directly as a byte[] (ie value type) generated via a BinaryFormatter and this fact noted in a BitVector32 flag. The rest of the collection state is still stored as now but using BitVector32 flags where possible.

Otis wrote:

The BitVector32 usage is elegant, I like it simple_smile It can also be used to store the IsDirty flags, just use per 32 fields a bitvector32 structure. but perhaps overkill, don't know.

I like BitVector32 also - it also has another use in that you can specfiy regions in the same manner (but not at the same time as the masks) so you can store variable length items.

Using multiple BitVector32 for the fields' flags won't be any more effective because the code already converts the bits to the minimum number of bytes and stored those.

Cheers Simon

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39588
Joined: 17-Aug-2003
# Posted on: 26-Jul-2006 15:08:44   

simmotech wrote:

Otis wrote:

Looks great! simple_smile I'll try to run some tests this coming weekend simple_smile I've to add: the combined template + runtime change combi is out of the question for now. I've done that in the past and it wasn't pleasant: people expect a clean build, if it breaks they'll ask us what's wrong. This thus creates support hell, which I won't risk over this.

OK, maybe a future release.

!!Actually, I've just noticed that the factory for the entity has a ForEntityName property which happens to be exactly what we need for the comparison - maybe we can use this string to compare against _name? !!

Indeed! simple_smile Completely forgot about that property. _name is only set in the constructor, so a subtype can set the LLBLGenProEntityName property's value. So in deserialization, it can always be set by the factory, which is obtained by calling CreateEntityFactory().

Otis wrote:

The array approach could be done from the template though. The generated code's GetObjectData() simply adds the collections to the info object, but it could of course call a utility method which creates an array of entities from it. The only drawback is that an entity collection by itself can have state and you'll miss that by using an array, so an array is likely not an option in many scenario's.

Not sure what you mean here: the collection's entities can be stored as objects as they are now or, under certain safe circumstances such as when the entities don't reference anything else and the collection is not owned by an entity, the entities can be stored directly as a byte[] (ie value type) generated via a BinaryFormatter and this fact noted in a BitVector32 flag. The rest of the collection state is still stored as now but using BitVector32 flags where possible.

Ok, I think we're talking about 2 different things simple_smile I was under the impression you wanted to convert EntityCollection<T> instances to T[] arrays and serialize those simple_smile .

Otis wrote:

The BitVector32 usage is elegant, I like it simple_smile It can also be used to store the IsDirty flags, just use per 32 fields a bitvector32 structure. but perhaps overkill, don't know.

I like BitVector32 also - it also has another use in that you can specfiy regions in the same manner (but not at the same time as the masks) so you can store variable length items.

Using multiple BitVector32 for the fields' flags won't be any more effective because the code already converts the bits to the minimum number of bytes and stored those.

Ok, then that's not that useful indeed.

Frans Bouma | Lead developer LLBLGen Pro
simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 28-Jul-2006 11:20:04   

Hi Frans

Here are my latest figures:

Out of Box: 13, 608,188 bytes; serialize in 92.81 seconds; deserialize in 58.00 seconds

Best non-factory method Yet: 6,814,154 bytes; serialize in 2.61 seconds; deserialize in 3.84 seconds (50% reduction in size - can get 57% but much slower) (95% reduction in time)

Best Yet (assuming we can somehow identify the type using existing LLBLGen meta-data) 5,953,338 bytes; serialize in 1.56 seconds; deserialize in 2.64 seconds (56.25% reduction in size) (97.22% reduction in time)

I have a couple of questions for you:- EntityBase2 has _IsNew and _IsDeleted properties; Its contained EntityFields2 has EntityState which can be New, Fetched, OutOfSync, Deleted. What is the relationship between the New and Delete of EntityFields2 compared to EntityBase2? Are there likely to be any more states added to EntityFields2?

The "Best Yet" figure assumes that we can identify an entity type within a single int32 or less (not a problem with my test rig - I am only using one type currently and I stick in a dummy Int32 value for now). I was thinking about trying to make use of the generated EntityType enum but since that is not in the support classes I wondered if there was a way to get to it from a given Entity instance. I suppose reflection could be used to find the assembly containing the entity type and then look for an EntityType enum. I don't know how common this might be but I believe it is possible that an application might use multiple LLBL-generated projects and therefore have multiple EntityType enums.

I also have in mind that if a collection has _entityFactoryToUse set and all the entities contained within the collection are of the type returned by that entity factory then I could set a bitflag and there would be no need to store the entity type at all.

Cheers Simon

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39588
Joined: 17-Aug-2003
# Posted on: 28-Jul-2006 11:40:14   

simmotech wrote:

Hi Frans

Here are my latest figures:

Out of Box: 13, 608,188 bytes; serialize in 92.81 seconds; deserialize in 58.00 seconds

Best non-factory method Yet: 6,814,154 bytes; serialize in 2.61 seconds; deserialize in 3.84 seconds (50% reduction in size - can get 57% but much slower) (95% reduction in time)

Best Yet (assuming we can somehow identify the type using existing LLBLGen meta-data) 5,953,338 bytes; serialize in 1.56 seconds; deserialize in 2.64 seconds (56.25% reduction in size) (97.22% reduction in time)

I'm floored by these numbers, excellent excellent work! smile . These optimizations have big potential. Let me first address your questions, then I can give you mine wink

I have a couple of questions for you:- EntityBase2 has _IsNew and _IsDeleted properties; Its contained EntityFields2 has EntityState which can be New, Fetched, OutOfSync, Deleted. What is the relationship between the New and Delete of EntityFields2 compared to EntityBase2? Are there likely to be any more states added to EntityFields2?

These things are more or less leftovers from the very first design: the idea is that the EntityFields(2) objects are DTO's, and have a state. The flags in booleans are used for some logic, they don't test the state, as 'IsNew' can be true while the fields state is OutOfSync, hence the necessity for the booleans, the state is checked in other situations, e.g. if the entity is outofsync or not. I don't foresee any more states added, as there aren't any more states thinkable. I've seen some o/r mappers who have more states for an entity, but these are rather farfetched and IMHO don't serve any meaning. for example, I don't believe in a 'fetching' state, as there's no meaning to that for the reading code.

so if youre conservative and reserve 3 bits for the state and 1 for isnew, you're fine. Isdeleted can be determined from the state: isdeleted.

The "Best Yet" figure assumes that we can identify an entity type within a single int32 or less (not a problem with my test rig - I am only using one type currently and I stick in a dummy Int32 value for now). I was thinking about trying to make use of the generated EntityType enum but since that is not in the support classes I wondered if there was a way to get to it from a given Entity instance. I suppose reflection could be used to find the assembly containing the entity type and then look for an EntityType enum. I don't know how common this might be but I believe it is possible that an application might use multiple LLBL-generated projects and therefore have multiple EntityType enums.

All entity types are definable with an int32, otherwise it would mean there are over 2^31 entities, something which isn't that logical wink . You can ask an entity for its EntityType value, by calling the property LLBLGenProEntityTypeValue. LLBLGenProEntityName returns _name btw, so use the factory.

I also have in mind that if a collection has _entityFactoryToUse set and all the entities contained within the collection are of the type returned by that entity factory then I could set a bitflag and there would be no need to store the entity type at all.

That's not possible, because you can have supertypes and subtypes in a collection of the supertype simple_smile .

My question to you: - what are the specifics for the 2 setups you listed the numbers for? I.e.: what do I need to change to get these numbers with the out-of-the-box code simple_smile (you can keep it briefly, just describe what's to be done)

Frans Bouma | Lead developer LLBLGen Pro
simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 28-Jul-2006 13:08:12   

Otis wrote:

My question to you: - what are the specifics for the 2 setups you listed the numbers for? I.e.: what do I need to change to get these numbers with the out-of-the-box code simple_smile (you can keep it briefly, just describe what's to be done)

Well I've got three types of serialization at the moment.

1) Safe (not mentioned above) This uses the BitVector32 flags (which I've now separated into another method because all three methods use it) and only saves only the information which isn't considered 'default' ie not null or State!=Fetched (which I assume would be the most used in serialization but I will look at using flags as per your previous message) etc. This actually gives the best compression (57.26%) but is still slow for serialization (s=64.99s and d=3.69)

2) FastSerializer (Best non-Factory method) This is a slightly optimized version of the code from http://www.codeproject.com/csharp/FastSerialization.asp Basically, instead of storing the various items with their own Name. All of the values are stored in a single byte[]. Anything that is not a value-type is stored by FastSerializer using a BinaryFormatter as normal. We get all the benefits of custom serialization but no new interfaces required.

3) FastSerializer (Factory method) The only issue with 2) is that type information for non-values types is still stored by the BinaryFormatter. This works fine but serialization time is still slow. Since LLBLGen already has a lot of type information, if we can work out a way of identifying a type and embed this a single Int32 (or less!) we have all the info needed for a collection to save its entities rather than using a BinarySerializer. Still no new interfaces required but tied to LLBLGenPro infrastructure (not a problem really since this is all internal to the GetObjectInfo and EntityBase2 deserialization constructor and won't be seen externally). By splitting the functionality into separate methods, we can have one version that returns a byte[] of the whole collection. I have a feeling that this will go through the remoting infrastructure much faster than a method returning an EntityCollection but contains everything required to create the EntityCollection on the other end.

I think overall what would be best is a combination of methods 2 and 3 to allow the minimum necessary information to be serialized but there may be a number of different optimized methods depending on the state of the item to be stored (some may be faster but only work with certain combinations of state).

I've got to go now but I'll send you sample code as soon as I have tidied it up.

Cheers Simon

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39588
Joined: 17-Aug-2003
# Posted on: 28-Jul-2006 13:16:29   

That would be awesome, thanks simple_smile

Frans Bouma | Lead developer LLBLGen Pro
simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 08-Aug-2006 15:14:03   

Hi Frans - I'm still going on this.... simple_smile

Here are my latest figures: Out of Box: 13, 608,188 bytes; serialize in 92.81 seconds; deserialize in 58.00 seconds

Previous Best Yet (assumed I could identity the type using existing LLBLGen meta-data) 5,953,338 bytes; serialize in 1.56 seconds; deserialize in 2.64 seconds (56.25% reduction in size) (97.22% reduction in time)

Current Best: 4,002,491 bytes; serialize in 0.92 seconds; deserialize in 1.30 seconds (70.59% reduction in size) (98.53% reduction in time)

New features: - Don't need to reference any EntityType enum (so there is no problem if there is more than one LLBLGen project in the solution) - came up with a way of caching entity factories used during serialization and then storing them in the minimum necessary space, so if there are less than 255 entity types in a collection then only a single byte is stored.

Further optimizations to try:- - Can try to optimize for 'common' values such as zero, one, MinValue, MaxValue. Therefore should be able to store in one byte rather than 2 to 4. - Look for sequences of null within an object[]. Each is taking only 1 byte anyway but if there are runs of nulls (up to 255) can store all of them in 2 bytes.

I have done minimal testing - just enough to prove the concept. I have two sets of data to serialize. One is a simple collection of the same entity (as per figures above) and one is a subset (2000 entities) with each top level entity having a 1->m relationship and that collection containing a single entry (to test for aggregates). I am using XmlSerializer to write the before-serialization and after-serializations to a string and comparing them for equality.

Questions: 1) Do you have any kind of test harness code and/or sample data available for comparing entities/collection? It would be useful if you do so that I am sure that things like inherited classes and other good stuff that I don't have knowledge about are not compromised. (In any event, I still have a 'safe' serializer that uses the flags to reduce data size but lets the BinarySerializer take care of the object graph)

2) I currently have most of the code in nested classes (in EntityBase2 and EntityCollectionBase2). The reason being that access is required to some private fields. If this is something that you would consider putting into a future version would you use nested private classes or make the private fields internal and move the code to an external class?

3) One of the assumptions the fast serializer makes is that entities and collections by and large do not have circular references. ie A Collection 'owns' all of its entities and they in turn 'own' any collections/entities stored in their properties. Do you see any problems with this?

4) All of the above leaves EntityBase2 and EntityCollectionBase2 pretty much unmodified. In fact the only non-static addition I think I have made is an object reference so that the safe serializer can do an IDeserializationCallback (and even that could probably be removed and the same work done via an event). Would you be willing to add serialization-specific method(s) (nothing that would affect any existing code of course!) for further optimization? I don't have anything specific in mind at the moment but I ran a profiler and there seems to be a lot of method calls made when recreating an entity collection for example. If it were possible to add a private or internal method that could do exactly the same from a known initial state without checking events/flags/relations etc. I think this would help.

Cheers Simon

mikeg22
User
Posts: 411
Joined: 30-Jun-2005
# Posted on: 08-Aug-2006 21:05:50   

I think entity graphs almost always have circular references. IE EntityA:EntityB (1:N). EntityA instances will have a collection of EntityB instances, and EntityB instances will have an EntityA member instance pointing back. This may not have been what you were talking about though simple_smile

Also, EntityA:EntityB(1:N), EntityC:EntityB(1:N). It is possible to have EntityA:EntityC(1:N) in this situation using EntityA.EntityCviaEntityB and also EntityC.EntityAviaEntityB. This is a type of circular reference.

simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 09-Aug-2006 05:16:30   

mikeg22 wrote:

I think entity graphs almost always have circular references. IE EntityA:EntityB (1:N). EntityA instances will have a collection of EntityB instances, and EntityB instances will have an EntityA member instance pointing back. This may not have been what you were talking about though simple_smile

Also, EntityA:EntityB(1:N), EntityC:EntityB(1:N). It is possible to have EntityA:EntityC(1:N) in this situation using EntityA.EntityCviaEntityB and also EntityC.EntityAviaEntityB. This is a type of circular reference.

I don't think the relationships examples you give should be a problem.

The way my serialization works is that the single object passed (collection or entity currently but maybe UnitOfWork would be a good candidate) is considered the 'root' object and 'owns' child entities and they in turn 'own' any collections/enties held internally. Each is responsible for serializing itself and its 'owned' items. Parent back references are not a problem (they are ignored simple_smile ) and are recreated automatically during deserialization. Thus it is more of a tree with a single collection or entity at the root.

The standard .NET serializer doesn't work quite like this. You can give it any object in the tree and it works both up and down (and sideways!) to recreate every object associated directly or indirectly with the initial object. It will also ensure that each object in the tree is only saved once no matter how many other objects reference it - thats why it is so slow for large amounts of objects.

Should the passed object not be a root object then my serializer could throw an exception or traverse up the tree and serialize from the root or just serialize downwards or just decide to let the normal serializer do its graph thing.

I think the circular reference question was more about situations where a member of the tree references another part of the tree where it is not part of a parent/child situation.

Cheers Simon

simmotech
User
Posts: 1024
Joined: 01-Feb-2006
# Posted on: 10-Aug-2006 09:36:47   

Hi Frans

My earlier message had some questions - maybe you missed them. The answers would be helpful in putting my serialization stuff in a more final, usable state.

Had another thought last night - using tokens for strings to prevent duplication. Thought this might slow things down too much so a further thought was to have two versions - Fast and Compact.

Tried it this morning and it worked a treat - even speeded things up. Current results are:- Original: 13, 608,188 bytes; serialize in 92.81 seconds; deserialize in 58.00 seconds

Current Best: 2,590,641 bytes; serialize in 0.29 seconds; deserialize in 1.10 seconds (80.96% reduction in size) (99.68% reduction in time) 320x faster for serialization and 108x faster overall

Cheers Simon

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39588
Joined: 17-Aug-2003
# Posted on: 10-Aug-2006 10:13:14   

simmotech wrote:

Hi Frans - I'm still going on this.... simple_smile

Here are my latest figures: Out of Box: 13, 608,188 bytes; serialize in 92.81 seconds; deserialize in 58.00 seconds

Previous Best Yet (assumed I could identity the type using existing LLBLGen meta-data) 5,953,338 bytes; serialize in 1.56 seconds; deserialize in 2.64 seconds (56.25% reduction in size) (97.22% reduction in time)

Current Best: 4,002,491 bytes; serialize in 0.92 seconds; deserialize in 1.30 seconds (70.59% reduction in size) (98.53% reduction in time)

That sounds amazing! simple_smile I haven't had time to check out the code you sent me, sorry for that (templatestudio had to be done first, which was yesterday simple_smile ).

Further optimizations to try:- - Can try to optimize for 'common' values such as zero, one, MinValue, MaxValue. Therefore should be able to store in one byte rather than 2 to 4. - Look for sequences of null within an object[]. Each is taking only 1 byte anyway but if there are runs of nulls (up to 255) can store all of them in 2 bytes.

Be aware that this eventually ends up in re-implementing lzh or other pack algo wink

I have done minimal testing - just enough to prove the concept. I have two sets of data to serialize. One is a simple collection of the same entity (as per figures above) and one is a subset (2000 entities) with each top level entity having a 1->m relationship and that collection containing a single entry (to test for aggregates). I am using XmlSerializer to write the before-serialization and after-serializations to a string and comparing them for equality.

XmlSerializer? You got that class working on the entity objects? I never could get it to work, as it always failed on me due to cyclic references, interface based types etc.

Questions: 1) Do you have any kind of test harness code and/or sample data available for comparing entities/collection? It would be useful if you do so that I am sure that things like inherited classes and other good stuff that I don't have knowledge about are not compromised. (In any event, I still have a 'safe' serializer that uses the flags to reduce data size but lets the BinarySerializer take care of the object graph)

Not in generic format. When I have to compare entities, I do that in the unittest where I need to compare them.

Inherited classes should work ok, as long as the type the data is deserialized in isn't relying on the entitycollection but on the type the entity had when it got serialized

2) I currently have most of the code in nested classes (in EntityBase2 and EntityCollectionBase2). The reason being that access is required to some private fields. If this is something that you would consider putting into a future version would you use nested private classes or make the private fields internal and move the code to an external class?

In general I avoid nested classes when they access their container's member variables. I'd go for internal properties.

3) One of the assumptions the fast serializer makes is that entities and collections by and large do not have circular references. ie A Collection 'owns' all of its entities and they in turn 'own' any collections/entities stored in their properties. Do you see any problems with this?

As said above, cyclic references are very likely to happen: myOrder.Customer = myCustomer; which makes: myCustomer.Orders.Contains(myOrder) == true

Furthermore, m:n relations are cyclic references by type: Customer contains Employees and Employees contains Customers. So your code should be aware of cyclic references and not rely on the absence of them.

4) All of the above leaves EntityBase2 and EntityCollectionBase2 pretty much unmodified. In fact the only non-static addition I think I have made is an object reference so that the safe serializer can do an IDeserializationCallback (and even that could probably be removed and the same work done via an event). Would you be willing to add serialization-specific method(s) (nothing that would affect any existing code of course!) for further optimization? I don't have anything specific in mind at the moment but I ran a profiler and there seems to be a lot of method calls made when recreating an entity collection for example. If it were possible to add a private or internal method that could do exactly the same from a known initial state without checking events/flags/relations etc. I think this would help.

I'm not that fond of events, as they're slow and can lead to memory leaks if the event subscriber doesn't unsubscribe from the event if it doesn't stay longer in memory than the event holder.

There's a method for doing post-deserialization fixups, I'm not sure if you need that one, but you can do things in there.

Frans Bouma | Lead developer LLBLGen Pro
1  /  2  /  3