How well does PostgreSQL perform with a large number of databases?
We have an web application whose architecture requires that any registered user (a company, actually) should be isolated from the other, i.e., I'll run the same webapp with the same data models, but with different data sets for every customer.
So, we did think about creating a different database in Postgres for every customer. Can this solution scale to, say, 10-20K databases? How well?
Does anyone have a better solution for this?
Thanks in advance.
Solution 1:
At the low end, it basically boils down to "can you absolutely say that you have no shared data?" Unlike mysql, the database is an absolute boundry in postgresql. You cannot SELECT zip_code FROM common.city_zip WHERE city=...
if you go with separate databases (at least not without dblink
).
If you have any shared data at all, postgresql's "schema" is similar to what mysql calls a "database". You can CREATE SCHEMA clienta; CREATE TABLE clienta.customer (...);
. You would create a schema for each client, that client's user would have their schema first in their search path, and permissions would be granted so that Client A's user would have access to the clienta
and the public
schemas (and their tables).
Your issue is going to be that at the high end of # of clients, each table is stored as a file, so whether you go with one database per client, one schema per client, or use something like ${client}_customer
for your table names, you will likely run into filedescriptor limits with 10k clients even if you only had one table per client (plus one filedescriptor per connection). Of course, you can adjust the kernel's maximum number of file descriptors on the fly using sysctl, but the per-process limit (ulimit) will require restarting postgresql if you set it too low the first time around.
The alternative is to have "one big table" with a client column that identifies which client that row belongs to (ideally, by username if you have one user per client, this makes the stuff below a LOT easier). By not granting any access at all to this table by the clients, you can create client-specific views (or use session_user
to identify the current client). Updates can't be done directly through a view, though. You would need to have defined functions to insert/update/delete on the table (one set of functions per client or else using session_user
) with the functions using SECURITY DEFINER
to execute as a special user with permission to insert/update/delete on the tables (note: session_user
is used because user
and current_user
are based on the current context, and within a SECURITY DEFINER function this would always be the user who defined the function).
Performance-wise, beyond the fd issue, I honestly don't know what would happen with 10000 databases in postgresql, versus having one large table with 10000 clients' worth of data in it. Proper index design should keep the large table from being slow to query.
I will say that I went with separate databases for each client here (we add servers to keep the system usable, shifting client databases to new servers as needed, so we will never get to 10k databases on one server). I've had to restore individual clients' data from backups for debugging or due to user error on a regular basis, something that would be an absolute nightmare on the "one big table" design. Also, if you intend to sell customization of your product to your clients, the "one big table" design might end up hobbling you as far as ability to customize the data model.
Solution 2:
Without more details about your application, it's difficult to say that you will get any additional security from this set up. If each client connects to the web app and there is a shared user from the web app to the database, then you haven't isolated your data in a way that is any different from using a single monolithic database. Accessing your data through properly parameterized stored procedures will provide you the level of isolation that you're looking for without the administrative headache of managing 10,000+ databases on any number of servers.
I have personally run a similar set up on a single database server using nothing more than parameterized stored procedures hitting a single database. If you can guarantee that the only access to the database is through stored procedures, there is no danger of data co-mingling in the results.
If you do want to go forward with your design, here are my primary concerns:
- running out of open file descriptors (
ulimit -n
) on your host OS - tuning 10,000+ databases for different querying patterns
- administering 10,000+ databases with different security concerns (backups and potential restores, do you really want to restore 10,000+ database if there is a server failure?)
- rolling out changes across 10,000+ databases