Life without JOINs... understanding, and common practices

Lots of "BAW"s (big ass-websites) are using data storage and retrieval techniques that rely on huge tables with indexes, and using queries that won't/can't use JOINs in their queries (BigTable, HQL, etc) to deal with scalability and sharding databases. How does that work when you have lots and lots of data that is very related?

I can only speculate that much of this joining has to be done on the application side of things, but doesn't that start to get expensive? What if you have to make several queries to several different tables to get information to compile? Isn't hitting the database that many times starting to get more expensive than just using joins in the first place? I guess it depends on how much data you've got?

And for commonly available ORMs, how do they tend to deal with the inability to use joins? Is there support for this in ORMs that are in heavy usage today? Or do most projects that have to approach this level of data tend to roll their own anyways?

So this is not applicable to any current project I'm doing, but it's something that's been in my head for several months now that I can only speculate as to what "best practices" are. I've never had a need to address this in any of my projects because they've never reached a scale where it is required. Hopefully this question helps other people as well..

As someone said below, ORMs "don't work" without joins. Are there other data access layers that are already available to developers working with data on this level?

EDIT: For some clarification, Vinko Vrsalovic said:

"I believe snicker is wants to talk about NO-SQL, where transactional data is denormalized and used in Hadoop or BigTable or Cassandra schemes."

This is indeed what I'm talking about.

Bonus points for those who catch the xkcd reference.


Solution 1:

The way I look at it, a relational database is a general purpose tool to hedge your bets. Modern computers are fast enough, and RDBMS' are well-optimized enough that you can grow to quite a respectable size on a single box. By choosing an RDBMS you are giving yourself very flexible access to your data, and the ability to have powerful correctness constraints that make it much easier to code against the data. However the RDBMS is not going to represent a good optimization for any particular problem, it just gives you the flexibility to change problems easily.

If you start growing rapidly and realize you are going to have to scale beyond the size of a single DB server, you suddenly have much harder choices to make. You will need to start identifying bottlenecks and removing them. The RDBMS is going to be one nasty snarled knot of codependency that you'll have to tease apart. The more interconnected your data the more work you'll have to do, but maybe you won't have to completely disentangle the whole thing. If you're read-heavy maybe you can get by with simple replication. If you're saturating your market and growth is leveling off maybe you can partially denormalize and shard to fixed number of DB servers. Maybe you just have a handful of problem tables that can be moved to a more scalable data store. Maybe your usage profile is very cache friendly and you can just migrate the load to a giant memcached cluster.

Where scalable key-value stores like BigTable come in is when none of the above can work, and you have so much data of a single type that even when it's denormalized a single table is too much for one server. At this point you need to be able to partition it arbitrarily and still have a clean API to access it. Naturally when the data is spread out across so many machines you can't have algorithms that require these machines to talk to each other much, which many of the standard relational algorithms would require. As you suggest, these distributed querying algorithms have the potential to require more total processing power than the equivalent JOIN in a properly indexed relational database, but because they are parallelized the real time performance is orders of magnitude better than any single machine could do (assuming a machine that could hold the entire index even exists).

Now once you can scale your massive data set horizontally (by just plugging in more servers), the hard part of scalability is done. Well I shouldn't say done, because ongoing operations and development at this scale are a lot harder than the single-server app, but the point is application servers are typically trivial to scale via a share-nothing architecture as long as they can get the data they need in a timely fashion.

To answer your question about how commonly used ORMs handle the inability to use JOINs, the short answer is they don't. ORM stands for Object Relational Mapping, and most of the job of an ORM is just translating the powerful relational paradigm of predicate logic simple object-oriented data structures. Most of the value of what they give you is simply not going to be possible from a key-value store. In practice you will probably need to build up and maintain your own data-access layer that's suited to your particular needs, because data profiles at these scales are going to vary dramatically and I believe there are too many tradeoffs for a general purpose tool to emerge and become dominant the way RDBMSs have. In short, you'll always have to do more legwork at this scale.

That said, it will definitely be interesting to see what kind of relational or other aggregate functionality can be built on top of the key-value store primitives. I don't really have enough experience here to comment specifically, but there is a lot of knowledge in enterprise computing about this going back many years (eg. Oracle), a lot of untapped theoretical knowledge in academia, a lot of practical knowledge at Google, Amazon, Facebook, et al, but the knowledge that has filtered out into the wider development community is still fairly limited.

However now that a lot of applications are moving to the web, and more and more of the world's population is online, inevitably more and more applications will have to scale, and best practices will begin to crystallize. The knowledge gap will be whittled down from both sides by cloud services like AppEngine and EC2, as well as open source databases like Cassandra. In some sense this goes hand in hand with parallel and asynchronous computation which is also in its infancy. Definitely a fascinating time to be a programmer.

Solution 2:

You're starting from a faulty assumption.

Data warehousing does not normalize data the same way that a transaction application normalizes. There are not "lots" of joins. There are relatively few.

In particular second and third Normal Form violations are not a "problem", since data warehouses are rarely updated. And when they are updated, it's generally only a status flag change to make a dimension rows as "current" vs. "not current".

Since you don't have to worry about updates, you don't decompose things down to the 2NF level where an update can't lead to anomalous relationships. No updates means no anomalies; and no decomposition and no joins. You can pre-join everything.

Generally, DW data is decomposed according to a star schema. This guides you to decompose the data into the numeric "fact" tables that contain the measures -- numbers with units -- and foreign key references to the dimension.

A dimension (or "business entity") is best thought of as a real-world thing with attributes. Often, this includes things like geography, time, product, customer, etc. These things often have complex hierarchies. The hierarchies are usually arbitrary, defined by various business reporting needs, and not modeled as separate tables, but simply columns in the dimension used for aggregation.


To address some of your questions.

"this joining has to be done on the application side of things". Kind of. The data is "pre-joined" prior to being loaded. The dimension data is often a join of relevant source data about that dimension. It's joined and loaded as a relatively flat structure.

It isn't updated. Instead of updates, additional historical records are inserted.

"but doesn't that start to get expensive?". Kind of. It takes some care to get the data loaded. However, there aren't a lot of reporting/analysis joins. The data is pre-joined.

The ORM issues are largely moot since the data is pre-joined. Your ORM maps to the fact or dimension as appropriate. Except in special cases, dimensions tend to be small-ish and fit entirely in memory. The exception is when you're in Finance (Banking or Insurance) or Public Utilities and have massive customer databases. These customer dimension rarely fits in memory.