How to scale out OpenStreetMap data efficiently
For over a year now, I'm running an in-house PostGIS server filled with OSM data, used for both Mapnik-based tile generation and Nominatim-based geocoding, updated with day replicates. This works pretty well.
However, as usage is growing exponentially, I would like to achieve better reliability and performance by adding additional PostgreSQL servers. And I'm kind of lost.
Since PostgreSQL doesn't seem to handle replication by itself, I would think about using a piede of middleware like PgPool-II to keep the servers in sync. But I'm afraid it would be nothing but necessary for this usage : very high read-to-write ratio, where all writes are done at the same exact time every day.
My questions are simple : What would you do to keep these servers in sync? And, what is done for this at the OpenStreetMap Foundation, MapQuest, Mapbox or CloudMade?
Thanks.
Solution 1:
Firstly, Postgres 9 and later do have builtin replication support and that is what we (the OpenStreetMap operations team) expect to look at using now that we are (as of about 90 minutes ago) running on Postgres 9.1 for the main database.
To answer your question as to what OSM do, well to date we don't do any kind of replication - there is one large Postgres database acting as the master database behind the web site and holding the raw data. There are then secondary Postgres+Postgis databases for rendering and for geocoding, both of which are populated using the planet dumps and diffs for updating.
Solution 2:
I can see two options here :
- Database pooling - replication http://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling , I would PG cluster as it is synchronous and permits load balancing . You can also go with master - slaves architecture but without connection pooling and asynchronous , it all depends on the architecture and the replication speed that you want to achieve .More infor about pgpoll-2 here -small FAQ(http://www.dalibo.org/_media/pgpool.pdf)
- WebOptimisation - for Mapquest and OSF as far as I know they use web caching at server level+ connection pooling. That means that it wouldn't request from Postgres /Postgis so many requests as i caches some data . 3.Postgis optimizations - preaching query's at DB level , indexing and database normalization . I would go for "PostGis in action" (http://manning.com/obe/) to learn more about postgis data types and adapt your server to them