.net string class alternative

Since I am planning an application that will hold MANY of its data in memory, I would like to have some kind of 'compact' string class, at least one which will contain string in format not larger than zero terminated ASCII version of the string.

Do you know of any such string class implementation - it should have some utility functions like the original string class.

EDIT:

I need to sort the strings and be able to scan through them, just to mention few of the operations that I will use.

Ideally, it would be source compatible with System.String, so basic search&replace action would optimize application memory footprint.

NUMBERS:

I could have 100k record of each record having up to 10 string having 30-60 characters. So:

100000x10x60=60000000=57mega characters. Why not have 60 megs of ram used for that instead of 120 megs of ram? Operations will be faster, everything will be tighter.

Trees will be used for searching, but won't be helpful in regex scans that I plan to have.


Solution 1:

EDIT: I now have a blog post on this topic which goes into a fair amount more detail.


Going by your numbers:

I could have 100k record of each record having up to 10 string having 30-60 characters.

Let's start off by adding in the object overhead - a string takes up about 20 bytes (IIRC - possibly more on a 64-bit CLR) plus the actual data, due to the inevitable object overhead and the length. Let's do the maths again:

Using string: 1 million objects at 20+120 bytes = 140MB

Using a new class: 1 million objects at 20+60 bytes = 80MB

Still a 60MB difference of course, but proportionally less than you were expecting. You're only saving 42% of the space instead of 50%.

Now, you talk about things being faster: given that the CLR is natively aware of string, I suspect a third-party class won't be able to match the speed for some of its operations, and you'd have to put a lot of work in to get many of the others to be the same speed. Admittedly you will have better cache coherency, and if you can ignore culture issues, that should save a bit of time too by making all comparisons ordinal.

For the sake of 60MB, I wouldn't bother. That's a tiny difference these days - consider how many more customers you'll have to gain through making this small saving in order to make up for the significant extra cost of working with two different string types.

Having said all of this, I'm quite tempted to implement it myself anyway as a blogging project like Edulinq. Don't expect any results for weeks or months though :)

EDIT: I've just thought of another problem. The numbers we've got above aren't actually right... because the string class is special. It embeds its data directly in the object - unlike any other data type apart from arrays, the size of a string instance isn't fixed; it varies based on the data within it.

Writing your own AsciiString class, you wouldn't be able to do that - you'd have to embed an array reference within the class:

public class AsciiString
{
    private readonly byte[] data;
}

That means you'd need an extra 4 or 8 bytes for the reference (32 or 64-bit CLR) and the additional overhead of an array object (16 bytes, IIRC) per string.

If you designed it like Java, taking a substring could reuse the existing byte array (two strings could share), but then you'd need an extra length and offset within AsciiString. You'd also lose some of the cache coherency benefits.

You could use just raw byte arrays as the data structure and write a bunch of extension methods to act on them... but that would be horrible, as then you couldn't tell the difference between a normal byte array and one which was meant to represent an ASCII string.

Another possibility would be to create a struct like this:

struct AsciiString
{
    private readonly byte[] data;
    ...
}

That would effectively give you strong typing again, but you'd need to think about things like:

AsciiString x = new AsciiString();

which would end up with a null data reference. You could effectively treat this as if x were a null value, but it would be pretty non-idiomatic.

Solution 2:

I've actually had a similar problem, but with somewhat different problem parameters. My application deals with 2 types of strings - relatively short ones measuring 60-100 chars and longer ones with 100-1000 bytes (averages around 300).

My use case also has to support unicode text, but a relatively small percentage of the strings actually have non-english chars.

In my use case i was exposing each String property as a native String, but the underlying data structure was a byte[] holding unicode bytes.

My use case also requires searching and sorting through these strings, getting substrings and other common string operations. My dataset measures in the millions.

The basic implementation looks something like this:

byte[] _myProperty;

public String MyProperty
{

   get 
   { 
        if (_myProperty== null)
            return null;

        return Encoding.UTF8.GetString(value);
   }

   set
   { 
        _myProperty = Encoding.UTF8.GetBytes(value);

   }

}

The performance hit for these conversions, even when you search and sort was relatively small (was about 10-15%).

This was fine for a while, but i wanted to reduce the overhead further. The next step was to create a merged array for all the strings in a given object (an object would hold either 1 short and 1 long string, or 4 short and 1 long string). so there would be one byte[] for each object, and only require 1 byte for each of the strings (save their lengths which are always < 256). even if your strings can be longer then 256, and int is still cheaper then the 12-16 byte overhead for the byte[].

This reduced much of the byte[] overhead, and added a little complexity but no additional impact to performance (the encoding pass is relatively expensive compared with the array copy involved).

this implementation looks something like this:

byte _property1;
byte _property2;
byte _proeprty3;

private byte[] _data; 

byte[] data;
//i actually used an Enum to indicate which property, but i am sure you get the idea
private int GetStartIndex(int propertyIndex)
{  

   int result = 0;
   switch(propertyIndex)
   {
       //the fallthrough is on purpose 
       case 2:
          result+=property2;
       case 1:
          result+=property1;

   }

   return result;
}

private int GetLength(int propertyIndex)
{
   switch (propertyIndex)
   {
     case 0:
        return _property1;
     case 1: 
        return _property2;
     case 2:
        return _property3;
   }
    return -1;
}

private String GetString(int propertyIndex)
{
   int startIndex = GetStartIndex(propertyIndex);
   int length = GetLength(propertyIndex);
   byte[] result = new byte[length];
   Array.Copy(data,startIndex,result,0,length);

   return Encoding.UTF8.GetString(result);

}

so the getter looks like this:

public String Property1
{
   get{ return GetString(0);}
}

The setter is in the same spirit - copy the original data into two arrays (between 0 start to startIndex, and between startIndex+length to length) , and create a new array with the 3 arrays (dataAtStart+NewData+EndData) and set the length of the array to the appropriate local variable.

I was still not happy with the memory saved and the very hard labor of the manual implementation for each property, so i built an in-memory compress paging system that uses the amazingly fast QuickLZ to compress a full page. This gave me a lot of control over the time-memory tradeoff (which is essentially the size of the page).

The compression rate for my use-case (compared with the more efficient byte[] store) approaches 50% (!). I used a page size of approx 10 strings per page and grouped similar properties together (which tend to have similar data). This added an additional overhead of 10-20% (on top of the encoding/decoding pass which is still required). The paging mechanism caches recently accessed pages up to a configurable size. Even without compression this implementation allows you to set a fixed factor on the overhead for each page. The major downside of my current implementation of the page cache is that with compression it is not thread-safe (without it there is no such problem).

If you're interested in the compressed paging mechanism let me know (I've been looking for an excuse to open source it).