Advanced data structures in practice

data-structures

In the 10 years I've been programming, I can count the number of data structures I've used on one hand: arrays, linked lists (I'm lumping stacks and queues in with this), and dictionaries. This isn't really surprising given that nearly all of the applications I've written fall into the forms-over-data / CRUD category.

I've never needed to use a red-black tree, skip list, double-ended queue, circularly linked list, priority queue, heaps, graphs, or any of the dozens of exotic data structures that have been researched in the past 50 years. I feel like I'm missing out.

This is an open-ended question, but where are these "exotic" data structures used in practice? Does anyone have any real-world experience using these data structures to solve a particular problem?

Some examples. They're vague because they were work for employers:

A heap to get the top N results in a Google-style search. (Starting from candidates in an index, go through them all linearly, sifting them through a min-heap of max size N.) This was for an image-search prototype.
Bloom filters cut the size of certain data about what millions of users had seen down to an amount that'd fit in existing servers (it all had to be in RAM for speed); the original design would have needed many new servers just for that database.
A triangular array representation halved the size of a dense symmetrical array for a recommendation engine (RAM again for the same reason).
Users had to be grouped according to certain associations; union-find made this easy, quick, and exact instead of slow, hacky, and approximate.
An app for choosing retail sites according to drive time for people in the neighborhood used Dijkstra shortest-path with priority queues. Other GIS work took advantage of quadtrees and Morton indexes.

Knowing what's out there in data-structures-land comes in handy -- "weeks in the lab can save you hours in the library". The bloom-filter case was only worthwhile because of the scale: if the problem had come up at a startup instead of Yahoo, I'd have used a plain old hashtable. The other examples I think are reasonable anywhere (though nowadays you're less likely to code them yourself).

B-trees are in databases.

R-trees are for geographic searches (e.g. if I have 10000 shapes each with a bounding box scattered around a 2-D plane, which of these shapes intersect an arbitrary bounding box B?)

deques of the form in the C++ STL are growable vectors (more memory-efficient than linked lists, and constant-time to "peek" arbitrary elements in the middle). As far as I can remember, I've never used the deque to its full extent (insert/delete from both ends) but it's general enough that you can use it as a stack (insert/delete from one end) or queue (insert to one end, delete from the other) and also have high-performance access to view arbitrary elements in the middle.

I've just finished reading Java Generics and Collections -- the "generics" part hurts my head, but the collections part was useful & they point out some of the differences between skip lists and trees (both can implement maps/sets): skip lists give you built-in constant time iteration from one element to the next (trees are O(log n) ) and are much simpler for implementing lock-free algorithms in multithreaded situations.

Priority queues are used for scheduling among other things (here's a webpage that briefly discusses application); heaps are usually used to implement them. I've also found that the heapsort (for me at least) is the easiest of the O(n log n) sorts to understand and implement.

They are often used behind the scenes in libraries. For example an ordered dictionary data structure (i.e. an associative array that alows sorted traversal by keys) is as likely as not to be implemented using a red-black tree.

Many data structures (splay trees come to mind) are interesting for their optimal behaviour in certain circumstances (temporal locality of reference in the case of splay trees), so they are mainly relevant for use in these cases. In most circumstances the real benefit of a working knowledge of these data structures is to be able to employ them in the right circumstances with a reasonable understanding of their behaviour.

Take sorting, for example:

In most circumstances quicksort or a modified quicksort that drops to another method when the individual segments get small enough is typically the fastest sorting algorithm for most purposes. However, quicksort tends to show suboptimal behaviour on nearly-sorted data.
the main advantage of a heap sort is that it can be done in situ with minimal intermediate storage, which makes it quite good for use in memory constrained systems. While it is slower on average (although still n log(n)), it does not suffer from the poor worst case performance of quicksort.
A third example is a merge sort, which can be done sequentially, making it the best choice for sorting data sets much larger than your main memory. Another name for this is the 'external sort', meaning you can sort using external storage (disk or tape) for intermediate results.

It depends on the level of abstraction that you work at.

I know I have similar experience as you do. At the current level of abstraction of most software development. Dictionary and List are the main data structures we use.

I think if you look down at lower level code you will see more of the "exotic" data structures.

Advanced data structures in practice

Related

Recent Posts