29 May 2013

Multi-tenanting Ruby on Rails Applications on Heroku - Part III

Schema-based vs Row-based Methods
My last two articles have introduced the topic and outlined common multi-tenanting methods. This article will examine the pros & cons of both methods especially in regards to apps using Postgres on Heroku. The content for most of this article came from an email exchange with Daniel (a Postgres wizard genius) at Heroku. I was migrating an existing bamboo-stack Rails 2.1 app to cedar-stack and Rails 3.1. I had already expected that I would have to change my method of multi-tenanting. I originally wrote a schema-based multi-tenanting for my Rails 2.1 app, which involved monkey-patching ActiveRecord and embedded in a record's id, both the tenant number as well as the row id number.
The email exchange has been edited but does leave in various tangents raised in our discussion of multi-tenanting methods. In the end, I decided to re-create my multi-tenanting using a row-based methodology, and turned it into a gem, called milia (Swahili for stripe).

Question: pgrestore is excruciatingly sloooooow. I am starting a new staging area on cedar stack for my production app. So I did a pgrestore from the production app DB (1.4MB) to the staging app; I started it 3 hours (yes, HOURS) ago, and it still hasn't completed the restore.
Me:
Thanks for looking into it. Currently, the way that I do multi-tenanting is by tablename, not rows. So my DB structure is wide and shallow not narrow & deep. Example: books__w1, books__w2, etc. I'm running in beta now, so have 9 tenants x 62 tables = almost 560 tables + 600 indexes, and 560+ schemas. I wonder if that's the reason for the slowness. I'm considering a move to do the multi-tenanting by postgres SCHEMA, and if PG is inefficient in handle a plethora of tables, that would be a great incentive to push the change. What do you think?
Daniel:
We have seen the scalability limits of pg_dump in particular (the postgres server was still fine) when there were tens or hundreds of thousands of database objects (tables, sequences, et al). It also tends to make the backups take a very long time, and now that you mention the sheer number of things that seems like a totally plausible contributing cause. The planner also becomes more expensive in resolving identifiers when there are so many symbols -- one can amortize this by using prepared statements to avoid re-running the planner.
The main advantages of schema-oriented multi-tenancy:
  • Each tenant can be cleanly lifted out of the database by dumping their schema and moved around
  • Unlike row-based multi-tenancy, you do not repeat tenant information all the time in every record, so overall compactness is greater
  • Locality is often better, since data for one customer that is likely to be accessed frequently is adjacent physically. copying out a single tenant for analysis can be much, much faster.
  • More fine-grained locking: a double-edged sword, but typically for the best; lock contention is often reduced when using more schema objects that do not have contending queries. This affects indexes in particular.
The downsides:
  • The planner becomes more expensive when there are huge number of identifiers to pick through.
  • Some tools (like \dt in psql, or pg_dump) can have a hard time coping with, say, millions of schema objects, whereas millions of rows is an extremely ubiquitous use-case.
Generally I prefer row-based multi-tenancy because it's much more common and tools are designed around handling a lot of records in a table.
Schema-based multi-tenancy is still compelling because it is in some ways very convenient (just load the application in a different schema context, use pg_dump with a schema name to back up exactly one customer, upgrade customers in a rolling-fashion independently), and really *ought* to work well, but is less well-trodden.
I think both can be made to work, but when doing schema-based multi-tenancy you will have to know a lot more abou the vagaries of a database's implementation rather than thinking about the the logic of the problem once you cross a certain number of schemas. However, as above, it is not without advantages, too.
Me:
Thank you for this excellent response; it is exactly the kind of information, comparision discussion for which I've been searching unsuccessfully. It is worthy of being an article in Heroko docs!
A couple of follow-up clarifying questions:
1. The postgres planner variables seem fairly complex .. but it doesn't appear that's what you're suggesting. You say "one can amortize this by using prepared statements to avoid re-running the planner." .. I am not familiar with how to do that .. are you implying to use a VIEW for that (which I have used for complex queries).
2. Wouldn't use of the Postgres SCHEMA space capability be efficient? http://www.postgresql.org/docs/8.4/static/ddl-schemas.html I could set the schema_path to just be the tenant currently in session, then it shouldn't be any more expensive than using the default public schema for the 60 tables I have, rather than the current method I've been employing of putting a suffix onto the tablename. This method involved monkey-patching ActiveRecord and I don't think it will work in the Rails 3.x way of doing things. I've been worried about scaling my app after I end beta .. what happens with a 1000 tenants, and 100,000 tables/indexes/etc. It sounds like you're agreeing with that. But how will a schema_space-based tenancy compare?
3. I originally considered many of the pros/cons you mentioned. It seems to me the row-based tenancy is less secure unless it uses an underlying DBMS failsafe mechanism of some type. Whereas, the schema-space (or tablename-based method which I currently employ) both locks the tenant code into the record id, and only gives the DB a query request based on a single tenant's space, potentially limiting damage in case of an error or intrusion.
Daniel:
A demonstration of prepared statements: Here I prepare a statement, causing it to be planned and the plan to be bound to a name:
  fdr=# PREPARE find_post_by_name AS SELECT * FROM posts WHERE id = $1; PREPARE
Later, I execute it:
  fdr=# EXECUTE find_post_by_name(1);

  id | user_id | contents
 ----+---------+-----------------------------<
   1 |       4 | This is a post about movies
  (1 row)
I think support for this in an ORM layer has been added in Rails 3.1, which is super-handy. There does seem to be a bug with the way it supports schema (precisely because it does *not* re-resolve identifiers); this user on this mailing list is doing something that sounds a lot like what you are:
http://groups.google.com/group/rubyonrails-core/browse_thread/thread/af654e29cc94cabb
In general, people do not stress-test postgres as often with a truly huge number of schema objects. There have definitely been missing indexes on certain catalog tables in the past when people have tried (these usually do get added in the next release), but even then, there's only so much one can do: larger data structures just require picking though larger caches and more memory. In the case with a very large catalog, the pace of resolving an identifier is constrained by O(log(n)) to poke through the index, just like what is the case with finding tuples.
Also, dumps and restores of the entire database will be slow, in part because underlying file systems are just not that fast at dealing with hundreds of thousands of files. These weaknesses are not really helped nor hurt by the existence of schemas.
I do think schema-based multi-tenancy is probably preferable to name mangling, should you decide to use some kind of schema-object partitioning for users. The caveat here is that SQL doesn't support recursive namespacing: one level of schema is all one gets, there are no nested schemas, so if you want to use schemas for another purpose you are out of luck: one is back to name mangling.
I would seriously look at patching your A/R (again...) or using subclassing or something to always apply a WHERE clause to *every* query issued against the database, making it impossible for most of the program to even consider data that doesn't belong to the relevant tenant. A convention for tagging every row with tenant information would accompany this approach. Lots of paranoid assertions at multiple levels may also be advised.
There are numerous other tricks, but at the end of the day, I'd look at the general topic of "row level security". There may even exist attempts at solving this problem, although I am not privy to how mature they are.
Me:
Thank you so much for the informative responses; I appreciate your expertise with Postgres. I noticed in the Rails postgres connection adapter several methods dealing with schema, so I will be exploring using them. I don't think the nested schema will be an issue. I think O(log(n)) .. similar to index searches, is reasonable.
My current method uses a singleton to control the multi-tenant nature of the app; other than placing an "acts_as_tenant" in the models, every other CRUD usage looks the same as any other Rails app.
When I setup my internal environment at start of each session, I set the tenant for the given user. acts_as_tenant essentially redefines AR::Base#table_name, to mangle the table_name based on given tenant.
In AR 2.3.x, I intercept at the beginning of find: any specified IDs in a select query are validated against allowable tenants for the user. The IDs used are 64-bits, where the upper 32-bits are the tenant_id, and the lower 32 are the actual record id. When a table is created, the min/max & sequence ID for that table in that tenant is established.
If I re-do, I think I'd like to be more low-level, in the postgres connection, itself. That would allow the IDs to be verified, etc.
My original thought, to use schema_path, won't work: it's not reentreant (the Postgres DB is a single instance .. I might have numerous dynos accessing it simultaneously). But PG does allow schema_name.table_name designations .. and the rails pg connection appears to support that as well. the trick will be to make it transparent to the app itself.
I don't like the existing multi-tenant gems as they are too intrusive in the code. What happens if a programmer forgets to wrap a DB access with the code/control blocks those gems require? that's a bug I don't want to search for.
daniel:
My recollection is that search_path is a per-backend setting; when one wants to switch tenants, your code can rebind the search path, paying mind to the existing caveat (really a bug) in Rails 3.1, if that mailing list post is correct. If you have threads sharing a connection then that does get messy (but then again, so do many features, including transactions), and qualifying names is a much better idea. I'm not sure if I follow what you mean "the Postgres DB is a single instance". In many ways, outside the data itself, each backend/connection is an "instance," and can have many of their own settings.
The big risk in having huge catalogs is that if there is a case Postgres upstream did not foresee then nominally you will be stuck waiting one whole release cycle for the index addition to the catalog. Plus, you have to upgrade to get that index. Basically, you lose the ability to put out your own fires, and you will have to be on a more stringent upgrade treadmill for Postgres in event of an issue.
On the other hand, It Has Been Done for tens or hundreds of thousands of database objects. I will caution you that at this moment (on Heroku) that PGBackups (and pg_dump) support is going to be an issue, and at this time we cannot support databases that simply overwhelm pg_dump.
Heroku's continuous archiving is so far able to deal with such database layouts, so the fork/follow features will still work.
There have been various approaches to row-level security, so I think that's what you want to investigate. There is this one somewhat crazy project called "Veil" that is *not* supported on Heroku, although I am intrigued by it. It's even more involved than an average extension, requiring a preload-hook at database startup.
All in all, I'd say your outcome with a row-level security is much more predictable, but that's not to say there are not nice properties of schema isolation.
Me:
You've really got me thinking of reconsidering my M/T strategy (which I first developed in 2008 back in Rails 1.8 days I think). I'm actually leaning towards a row-based method after thinking about your concerns.
Guy Naur also has a good discussion of different M/T strategies with Rails examples: http://aac2009.confreaks.com/06-feb-2009-14-30-writing-multi-tenant-applications-in-rails-guy-naor.html. He also confirmed that Postgres schema_path is CONNECTION-based; so it is feasible.
But, Postgres has nice updatable Views & Rules which make it possible to lock down all table row accesses and force a row tenant check (perhaps tie this in to a Rails scope as well?). I looked at Veil project, but I think it's overkill for what I need and implementation in a heroku environment looks prohibitive. I can force all tables to have a tenant_id column, treat tenant nil as universal (user, world, product tables), etc.
Row-based will be much easier to implement than schema (or it's table_name mangling equivalent).
Current data analysis and future projections
Below is a table of some data based on my 4 most active beta clients after a year of usage, and extrapolations for each of three possible growth scenerios: PAYS RENT, LIFESTYLE, VENTURE.
scenerio   tenants  users     table-A   table-B     tables  pg_dump
               n       n       rows/yr  rows/yr        n       hrs
-------------------------------------------------------------------
ave            1       91      2,019      1,480        100    0.04
beta           4      363      8,076      5,922        400    0.2
pays rent    100    9,075    201,908    148,042     10,000    4
lifestyle   1000   90,750  2,019,083  1,480,417    100,000   42
venture    10000  907,500 20,190,833 14,804,167  1,000,000  417

Assumptions used for projections
I'm assuming that pg_dump time will be linear. I've included row counts for my two biggest growth tables. The app does automatic duty-roster scheduling for collaborative organizations (such as non-profits). Both Tables A & B show the approx ANNUAL row growth rate. All other tables are more or less dependent on the number of users, which tend to reach a steady state per subscribing client (the organization).
The tables column refers to number of tables (and primary id sequence tables, but not include indexes tables).
I rely on pg_dump for backups, refreshing the staging app, refreshing the local development app, etc. Having it grow to where it requires an hour or more in pg_dump time is not practical.
Summary of ROW-BASED Pros & Cons
Pros:
  • simpler implementation
  • ability to use existing rails migration tools
  • use of Postgres Views/Rules & Rails 3 scope to enforce
  • faster pg_dump, typical use case optimization in DB parser/planner for SELECT QUERIES
  • no need to jerry-rig IDs
  • no monkey-patching of Rails A/R (maybe just the connection)
Cons:
  • might break down at 20 to 50M records in a single table?
  • difficulty in partitioning?
daniel:
I think row-based multi-tenancy is the more well-tested model simply because lots of software assumes there will be many millions of records in a table. Consider using CREATE FUNCTION with LANGUAGE SQL or LANGUAGE PGSQL (the procedural variant) instead to handle data modification, should you opt for database-side constructs. I think it's fine to do multi-tenancy at either the application or database level, the only advantage to the latter is perhaps making those multi-tenant-aware manipulations available to other applications, or to gain one more level of defense.
The main reason to partition a table is to be able to drop and scan entire partitions in one shot (physically organizing data to be adjacent for these reasons). There are definitely (many) databases out there with 50 million records in a single table that meet requirements well, but it basically depends on workload and how big (wide) the records are. For example, indexing a text column is many times more expensive than an integer one. If you are particularly worried about not painting yourself into a corner, you will have to experiment and write simulators. Remember that Heroku databases are charged per hour or even more granular than that, so if you have a simulation you like you can provision => run => deprovision right away.
me:
I've finally completed the migration of my app to Rails 3.1.3, including the total re-write of the multi-tenanting. I chose the row-based approach that you recommended, and thought you'd like some metrics for future reference.
  • Migrating the existing database: I had to write a combination of script (18-steps), rake tasks, and small Ruby code (about 200 lines) to completely transform the database from schema-based to row-based. I wanted it to be automatic (was able to take the existing production DB and turn it into the new DB in under 8 minutes) and I wanted it to be pristine (ALL indexes, sequences, and tables to be as though created fresh). I chose to do the ID field (64-bit down to 32-bit) transformations locally so that I could use POSTGRES functions.
  • resulting DB is 50% the size of the schema-based.
  • pg_restore is lightening fast now
  • DB accesses are also significantly faster; POSTGRES is more efficient with 32-bit ID fields rather than 64-bit.
  • row-based Multi-tenanting, using Rails default_scope was much simpler, and totally non-intrusive. 
  • I made it into a gem called "milia", which is production-ready (altho my tests are a bit behind), and in-use now in my production app with Devise & DelayedJob.

The next article will be a tutorial for using my multi-tenanting gem, milia, in a RoR app on Heroku.

No comments:

Post a Comment