What is a recommended architecture for providing storage for a dynamic logical database schema?

To clarify: Where a system is required to provide storage for a model whose schema may be extended or altered by its users once in production, what are some good technologies, database models or storage engines that will allow this?

A few possibilities to illustrate:

  • Creating/altering database objects via dynamically generated DML
  • Creating tables with large numbers of sparse physical columns and using only those required for the 'overlaid' logical schema
  • Creating a 'long, narrow' table that stores dynamic column values as rows that then need to be pivoted to create a 'short, wide' rowset containing all the values for a specific entity
  • Using a BigTable/SimpleDB PropertyBag type system

Any answers based on real world experience would be greatly appreciated


Solution 1:

What you are proposing is not new. Plenty of people have tried it... most have found that they chase "infinite" flexibility and instead end up with much, much less than that. It's the "roach motel" of database designs -- data goes in, but it's almost impossible to get it out. Try and conceptualize writing the code for ANY sort of constraint and you'll see what I mean.

The end result typically is a system that is MUCH more difficult to debug, maintain, and full of data consistency problems. This is not always the case, but more often than not, that is how it ends up. Mostly because the programmer(s) don't see this train wreck coming and fail to defensively code against it. Also, often ends up the case that the "infinite" flexibility really isn't that necessary; it's a very bad "smell" when the dev team gets a spec that says "Gosh I have no clue what sort of data they are going to put here, so let 'em put WHATEVER"... and the end users are just fine having pre-defined attribute types that they can use (code up a generic phone #, and let them create any # of them -- this is trivial in a nicely normalized system and maintains flexibility and integrity!)

If you have a very good development team and are intimately aware of the problems you'll have to overcome with this design, you can successfully code up a well designed, not terribly buggy system. Most of the time.

Why start out with the odds stacked so much against you, though?

Don't believe me? Google "One True Lookup Table" or "single table design". Some good results: http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056

http://thedailywtf.com/Comments/Tom_Kyte_on_The_Ultimate_Extensibility.aspx?pg=3

http://www.dbazine.com/ofinterest/oi-articles/celko22

http://thedailywtf.com/Comments/The_Inner-Platform_Effect.aspx?pg=2

Solution 2:

A strongly typed xml field in MSSQL has worked for us.

Solution 3:

Like some others have said, don't do this unless you have no other choice. One case where this is required is if you are selling an off-the-shelf product that must allow users to record custom data. My company's product falls into this category.

If you do need to allow your customers to do this, here are a few tips:
- Create a robust administrative tool to perform the schema changes, and do not allow these changes to be made any other way.
- Make it an administrative feature; don't allow normal users to access it.
- Log every detail about every schema change. This will help you debug problems, and it will also give you CYA data if a customer does something stupid.

If you can do those things successfully (especially the first one), then any of the architectures you mentioned will work. My preference is to dynamically change the database objects, because that allows you to take advantage of your DBMS's query features when you access the data stored in the custom fields. The other three options require you load large chunks of data and then do most of your data processing in code.

Solution 4:

I have a similar requirement and decided to use the schema-less MongoDB.

MongoDB (from "humongous") is an open source, scalable, high-performance, schema-free, document-oriented database written in the C++ programming language. (Wikipedia)

Highlights:

  • has rich query functionality (maybe the closest to SQL DBs)
  • production ready (foursquare, sourceforge use it)

Lowdarks (stuff you need to understand, so you can use mongo correctly):

  • no transactions (actually it has transactions but only on atomic operations)
  • this stuff here: http://ethangunderson.com/blog/two-reasons-to-not-use-mongodb/
  • durability .. mostly ACID related stuff

Solution 5:

I did it ones in a real project:

The database consisted of one table with one field which was an array of 50. It had a 'word' index set on it. All the data was typeless so the 'word index' worked as expected. Numeric fields were represented as characters and the actual sorting had been done at client side. (It still possible to have several array fields for each data type if needed).

The logical data schema for logical tables was held within the same database with different table row 'type' (the first array element). It also supported simple versioning in copy-on-write style using same 'type' field.

Advantages:

  1. You can rearrange and add/delete your columns dynamically, no need for dump/reload of database. Any new column data may be set to initial value (virtually) in zero time.
  2. Fragmentation is minimal, since all records and tables are same size, sometimes it gives better performance.
  3. All table schema is virtual. Any logical schema stucture is possible (even recursive, or object-oriented).
  4. It is good for "write-once, read-mostly, no-delete/mark-as-deleted" data (most Web apps actually are like that).

Disadvantages:

  1. Indexing only by full words, no abbreviation,
  2. Complex queries are possible, but with slight performance degradation.
  3. Depends on whether your preferred database system supports arrays and word indexes (it was inplemented in PROGRESS RDBMS).
  4. Relational model is only in programmer's mind (i.e. only at run-time).

And now I'm thinking the next step could be - to implement such a database on the file system level. That might be relatively easy.