Re: Managing database change: improvements needed

On Wed, Aug 26, 2015 at 12:13 PM, Erik Huelsmann <..hidden..> wrote:

Hi,

Hi all,

There are a few issues that I have with the way we currently handle database schema changes.

Current situation

============

We have a main schema file for the table definitions and a number of modules with stored procedures grouped by "subject".

Next to that, there's a "Fixes" file, which contains all incremental updates to the schema since some version.

When I want to change the schema as part of our development, I need to change the schema definition files *and* I need to add the schema changes to the Fixes.sql file.

I agree this is less than ideal. I think we need to think of a comprehensive tooling approach and there are a few reasons we haven't done this (mainly because I think getting the tooling right is a lot of effort and there has in the past been other urgent things that needed more effort at the beginning). However I think this is growing in urgency as we continue development.
Right. I'm merely summarizing for those who didn't know the current situation and to support my line of reasoning. (I don't think you think that I am, but for the record: I'm not criticizing the current situation.)

Problems with the current approach

==========================

The current approach has (IMO) several problems:

* When loading the "Fixes.sql" file, even in a production environment, produces a truckload of ERRORs, because the schema may contain some or all of the fixes in the file, leading to aborted transactions

* Making adjustments in multiple places can lead to incompleteness on either side

* Fixes.sql is loaded twice; before and after loading the stored procedures, making it unsuitable for some kinds of fixes

* In order to change types or parameter lists of stored procedures, they need to be dropped before being (re)created, resulting in DROP TYPE and DROP FUNCTION all over the place.

Especially the high number of ERRORs in the log file when creating a new database is disconcerting to new users. Myself, I'm running into the DROP TYPE and DROP FUNCTION more and more often as my refactorings for 1.5 and 1.4-mc reach deeper into the system.

Additional requirements

=================

* Next to solving the problems above, some users have expressed the desire to extend LedgerSMB locally. However, the way we currently upgrade the system removes all customizations (at least from 1.2->1.3->1.4; maybe not 1.4->1.5). If our changed approach could help solve this requirement, all the better.

* The solution chosen preferably works with a strategy of branching and merging as we do in the development cycle.
Agreed on all the above.

Good. Then we have good reasons to change :-)

Possible solutions

=============

One thing that I think we should do is separate development from production when loading the database. That is to say: when creating a new database, that is a different step than upgrading a production database which is in turn again very different from working on a development database. Our processes should work to provide the best for each.

I am not sure I understand what you mean by this. Do you mean that we need an extra build step? I would not be opposed to that. But if that isn't what you mean, could you be more specific?

Well, other than the standard Perl steps "copy to blib" and "manify POD", we don't really have build steps of our own. I'm indeed proposing that we have a build step. I envision it to be something like this:

* Create PostgreSQL database
* Play Sqitch scripts
* Dump schema from the database

and then consider the dumped schema a build artifact which doesn't need version control. This step would be executed before creating a distribution archive.

The pre-distribution step should -if we go with Sqitch- also include compilation of a Sqitch bundle, so we don't distribute development progress, branches, or whatever that might be in the Sqitch repository (which is - as far as I understand it - stored in the development tree).

I was assuming we would ship the schema. So for a new db, no need for the end user to build from sqitch. We could also use sqitch to generate migration files etc. That way *we* need sqitch to get things going, but users don't need it.

Oh one downside to sqitch reading the docs is that it requires union merge on the sqitch.plan file so we may want to wait a little for this to be supported in Mercurial. That would also provide an easily checkable intermediary step.

Proposed solution to the duplicate change problem

=====================================

In the past we have been talking about adding a "build" step to LedgerSMB. This step could be used to produce a single schema file for quickly setting up a new database (company). This schema file would be a build artifact and no longer be version controlled itself. It'd be the result of a full schema creation step, including running the Fixes.sql file.

Additionally, this step could be used to deliver an up-to-date doc/database/ directory with current database documentation.
I like that idea.

While this step may feel inhibiting for development, one thing I'm thinking is that we may not need to require this step to be executed on a development system, except for when testing the production deployment.
And testing the build system.

Exactly. I think that the Travis-CI tests (c/sh)ould include the generation of the schema and testing off of that.

Proposed solution to the slew of errors from Fixes.sql

=======================================

There are actually a number of solutions here, as I see it, all of them revolving around the idea that every schema change should be applied once. Basically, I see 2 categories of solutions:

1. Do it ourselves

2. Re-use the work of others

The benefit of (1) is that we get full control and no further dependencies for development or production. However, the downside is that we get to do all the thinking and problem solving as well.

In both categories I see similar solutions available:

a. Numbered changes

b. Named changes

and in (1) I see a solution that's not available in category (2):

c. Use PostgreSQL EXTENSIONs

As for category (2), I haven't looked too far around yet, but I did find sqitch (http://sqitch.org/); Sqitch offers a command line tool for the development workflow. Additionally, it provides facilities for deploying only the necessary changes with releases *and* it provides an API which we can use to upgrade the database from one (patch) release to another.
PG Extensions are definitely in the "reuse work of others" category but the system solves a different problem than sqitch.

This is an interesting difference in views and we should explore it a bit.

Indeed the two together could be a very powerful combination.

PG Extensions would allow us to ship db components (whether table sets, etc) in a versioned way. There is likely to be a learning curve here and it adds a major build dependency for people installing from our source packages. But it provides some really nice features including versioning and dependency tracking/enforcement.

What PG Extensions would allow us to do would be to effectively ship versioned pieces of our db schema and have an orderly way of upgrading them.

Ok. So far, I'm not sure I see the difference with Sqitch, apart from the fact that Sqitch is something we'd have to depend on and extensions are built into PostgreSQL. One thing where I can see a difference - but we wouldn't be using it, is my current estimation - is managing *external* dependencies. But with Sqitch, we can manage internal dependencies, so, if that's what we're using extensions for, what would be your envisioned difference?

There are a few differences. First, reloading the db stored procedures can be as simple as "drop extension... create extension...." across each of the processing modules. The second thing is we can track internal dependencies at install time without pushing another versioning system to users (they need the pg build tools installed but they will be from any source build and only a small package away for debian users).

Secondly, we can make our version tracking useful for external modules.

For things like stored procedure sections, we could always drop the extension and load a new version. But it would make it very clean. And we could even break up parts of the db into modules that could be independently versioned.

Getting this tooling to work right is not a minor project however. It definitely cannot be done in a minor release because of the additional dependencies, and I think we'd need a policy against adding extensions to core in major versions. However it would also provide a nice platform for third party integration and reuse as well.

Each of our modules could be an extension and separately versioned.

Ok. I like the technical side of that, but I'm not sure we need that at this point. I *do* see value for this in terms of optional modules. Say the way we used to have Budgetting as an extension in 1.3 (although it solves only 20% of the problem, because hooking up Budgetting into the web-application was actually a bigger problem than getting the tables installed).

Right. This is in part for the benefit of optional modules and third party applications.

For branching and merging, the extensions would have a number specific to development head and could be re-numbered during the build process. For modules all we really need to do is drop extension/create extension if the version number has changed.

We'd probably want to reserve a number series for "next in version series." I am thinking that using 999 would be a good one, so 1.4.999 would be "whatever next comes in 1.4" and this would allow us to install and run out of the development tree.

Hmmm. Sqitch - like version control systems - does this by itself (in casu by generating a hash, like Git and other DVCSes).

Right. I am proposing a difference between what we *ship* and what we develop for storage schemas. I am also saying I don/t think that is needed for the stored proc modules.

As for sqitch, I have no experience with it yet. It's supposed to work well with branching and merging. One thing it *does* do is integrate with version control tools and it means to integrate with the project's repository. Ideally when merging and branching, no additional processing is required to integrate the changes from branches. However, I'm not exactly sure that's what happens (given https://groups.google.com/forum/#!searchin/sqitch-users/merge/sqitch-users/GXqgt7nJ_1k/Vvg-r1HOEqMJ) but I think the referenced link is about reordering changes which already have been partially deployed. What I like about Sqitch is that it integrates with the VC system, but is VC system agnostic, some of us can use it with Git while others can keep using Hg-git as they currently do.
Ok so if we ship our storage schema as an extension (or group of extensions) we still need to have ordered versioning going on.

Well, it seems to me that that's what Sqitch does, *unless* you have a database which is running some development version. From tag to tag (which would be from release to release), ordering doesn't seem to matter. The problem the page seems to refer to - but I should probably subscribe to the mailing list and ask this quention - a database which is left at an intermediary state.

I am thinking that sqitch may allow us to do the grunt work of the schema change scripts as a part of the build phase (we have to strip out begin/commit statements since create/alter extension adds those), but it seems like it would be very helpful in managing that side.

Yea. Maybe this too is a question to ask Sqitch's mailing list.

What I *am* sure about is that (1) will be completely change-oriented. *Maybe* we can use sqitch with a the current module-based source code organization.
If we go with extensions we don't need to. We just bump the version number on each change of a module and keep a catalog of what is required (and tests to verify of course). We can then drop/create extensions as needed for the procedural modules, and we can generate migration scripts for the for the storage extension(s). The big difference is that instead of one db version we would have a bunch of components that were versioned and likely a requirement that these match on login.

Hmm. I'm somehow that doesn't sound like a really attractive outlook: more things that (apparently) can go wrong.

Well, there's no reason we cannot easily have a config file that sets required versions. Since these are tracked by pg internally, it would be very easy to check.

If we're going to go with Sqitch, I don't think it's a good idea to switch just then and there, but test it to build up some experience and then choose a well defined point in time to start using it.
Agreed. Maybe start on plugins?

Actually, I was thinking of starting with 1.5 *after* it's been upgraded from 1.4? However, that would mean that *everybody* would need to migrate through 1.5.0, because that's the moment we'll move to Sqitch? (Or we could choose 1.6 or ...?)

That places a lot of responsibility on a .0 release. I would suggest if we go that route, we still maintain migration tools for 1.4 + 1.3 and we ship db schemas rather than sqitch plans. Maybe after we have experience with it, we can then switch to sqitch plans in a future migration branch?

So, there are 2 proposals here. What do you say?
I don't see how we can get away from doing some things ourselves, so I guess the question is what and where to reuse other work.

So my preference would be sqitch and pg extensions. In my mind this would require a two-step build process for dbs:

1. Build the schema extensions using sqitch
2. Install all pg extensions using make.

Ok. Is this on releases only, or with our day-to-day work on the tree?

For db storage schema changes, on release only. We can always play sqitch on dev databases.

My view on using Sqitch for everything in our schema would be to replace the current schema loading with a call to Sqitch's API. Of course we have to do that ourselves, but it's not in the category that I'd call "do something ourselves", because all we have to do is create the code to delegate to "the other project's code". But figuring out *how* to do that, yes, that's our work.

BTW, another similar tool that may be of use:

http://search.cpan.org/dist/App-Dest/lib/App/Dest.pm

That may be more general as a build/deployment tool also.

The first is a developer-only task (sort of like running YACC is when building PostgreSQL -- not even required when installing from source packages) so there isn't a user-visible dependency there.

The second though does impose some user dependencies in that pg_config and other pg development tools must be installed before the build can be done. Of course for those installing debs, these would be done when building debs rather than when installing them.

If we use sqitch only, we also need it as a dependency, both for developers and for end users. However, we probably don't need to learn our users about it, if we simply use it as a library.

True, but we have to support it. And we have to depend on it, make sure it is installed, etc.

Also there are a bunch of areas I am not 100% sure about in this regard (why I think extensions are cleaner).

For example we support a bunch of different versions of PostgreSQL. Some, but not all, will throw an error if you do something like:

create or replace function foo (bar int) returns bool language sql as $$ return true; $$;

create or replace function foo (baz int) returns bool language sql as $$ return true; $$;

The reason is that more recent versions of Pg tightened up the handling of named arguments so now the change in argument name requires a drop function before the new one can be created. I don't know how sophisticated sqitch is in this regard, and how easy it will be to mess things up so a rebuild ends up with dropped functions that are never reloaded.

As we have it right now, storage tables are separate from data processing routines so with the extension system rebuilding is as simple as:

BEGIN TRANSACTION;

DROP EXTENSION...

DROP EXTENSION ....

...

CREATE EXTENSION ....

....

COMMIT;

If something goes wrong, the extensions are exactly where they were before the failed effort. In other words, our entire db rebuild/upgrade process would be entirely transactional (while right now it is one transaction per module), and it would be more predictable, clearer, simpler, and with probably less code.

So that is my preference and why.
Thanks for your response!

With my reaction above, I'm wondering how to get to a next step. We probably need to be practical in some sense and just start *somwhere*. But the question would be *where* and *with what* -- in the sense that the *what* question seems to be a choice between (as we now know them) (a) Sqitch or (b) Sqitch+Extensions.

Looking forward to further discussion,

--
Bye,

Erik.

http://efficito.com -- Hosted accounting and ERP.
Robust and Flexible. No vendor lock-in.

------------------------------ ------------------------------ ------------------ ______________________________ _________________ Ledger-smb-devel mailing list Ledger-smb-devel@lists. sourceforge.net https://lists.sourceforge.net/ lists/listinfo/ledger-smb- devel

Best Wishes,

Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.

http://www.efficito.com/learn_more