Re: Database schema changes && related data migrations

On Sun, Oct 23, 2016 at 2:50 AM, David G <..hidden..> wrote:

As of 1.5, we have a great database schema maintenance tool in place (sql/changes/) which remembers which changes have already been applied to the schema and which still have to be applied. This new tool has gotten rid of the truckload of warnings and errors that the repeated loading of Fixes.sql gave us on 1.3 and 1.4.

Preparing to merge my MC branch to master for 1.6, I'm thinking that the sql/changes/ infrastructure works perfectly to handle schema changes.

However, if such a schema change would be accompanied with the need to run a data migration, the sql/changes/ infrastructure suddenly won't work too great anymore, because I'm expecting that we'll want to improve on data migrations as we gain knowledge of the various data sets out there.

Regardless of the technical limitations of the sql/changes/ infrastructure, I'm also envisaging that we'll want to do "pre-migration checks" and allow users to correct data before the data migration (to a changed schema) is executed, if they have data that's non-compliant. While the pre-migration checks are definitely an easy extension to the sql/changes/ infrastructure, it's hard to provide feedback from the checks as it is. The other requirement (fixing data) can't be resolved at all.

My current thinking is that schema improvement becomes a continuously incremental thing instead of these one-off huge jumps such as we had on 1.3 and 1.4, *if* we find a way to address these requirements:

Just to clarify, the following points are all targeted at checks and changes to DATA to ensure it's consistent as per a new schema.
For example, there could be a case where a date that should always be set historically hasn't been sometimes.
A new schema may now have a constraint requiring it to be set, but the dataset needs to have the missing dates added.

That, or in the case of the MC branch, the original data migration is likely to overlook some scenarios for data to be migrated. My expectation is that we'll need two types of actions here:

a. Run an adjusted data migration on all new migrations

b. Run (once and only once) a migration-adjustment on migrations run prior to the improvements

What I'm looking for is a structure / design which satisfies these requirements so we can start coding the migrations.

1. Run pre-checks on schema changes

1a. generate a "pre-check" report

While I generally agree, we currently run all changes in a single call to the database schema manager which applies all changes, until it hits the first unsuccessful one.

2. Allow users to adjust / add / delete data in order for the pre-checks to be satisfied

2a. generate a "post-corrections" report

Our current changes are fully coded in SQL code, with a small Perl driver to manage the order in which the changes are being applied in the right order and only once (the last bit is the improvement over Fixes.sql).

From the 1.3/1.4 days, we have some pretty extensive SQL included in our LedgerSMB::Upgrade_Tests module. Those get tested every time an upgrade is being run and are not directly related to a specific change being applied. I'd rather not use a similar approach with the sql/changes/ infrastructure, because for any change that we ran in the past, we don't need to run the pre-migration checks any more.

3. Run post-schema-change data migrations

4. Allow data migration for the same change to be tweaked (i.e. different data migrations == different sha-sums)

5. Allow additional data migrations to be run *iff* (if-and-only-if) a specific data migration has been run in the past (my idea here is to allow the data to be "upgraded" according to progressive insight)

6. Ask the user to save the two reports and contact the team on #ledgersmb to discuss what needed manual adjustment.
The Logs can be provided *IF* it's required.

Ok. That works (for those users who don't mind sharing part of their data). However, you skipped the complicating factor that we may want to run "follow-ups to (3)" in later migrations, if a specific version of the migration script has been run before. Or am I making things too complex now?

It may be possible to come up with a way to anonymise the logs and provide an optional step to do so.

I would really love to hear your thoughts on the topic!

Other than my above insertions I think it's a sound plan that not only improves our ease of data migration, but also should drastically simplify imports from other software such as SQLLedger (which we have had quite a few migrations in the last 12 months) and hopefully others such as Quickbooks, MYOB

:-) Well, the plan is there, but now I'd like us to come up with a structure that might actually *do* this. One thing I can think of is that we add a file next to the change file which holds one or more queries to do the pre-migration checks. These queries would be executed and when returning zero rows, the schema change is good to go.

This idea doesn't address the supposedly required adjustment screens nor the post-migration scripts for migrating the data in the various envisioned versions. One step at a time though.

I'll assume that everybody is ok with the idea of having a second file with pre-checks. Then the next step will be: what to do if one of the queries *does* return a number of (failing) rows? Then we need:

- A way to tell the change-caller about the failing pre-checks

- A way for the caller to send feedback to the user

I'm formulating these as abstractnesses, because I don't think we should do as older code does - assume that we're in a web-app context and simply send a response to a presumed request. The caller is responsible for handling user feedback. The question then becomes: what information does the user need to be able to resolve the data issues?

My thought here is that the user needs:

- An explanation of what the change itself is trying to do

- An explanation of the condition that failed

- An explanation of the available corrective actions as the development team sees them

- For every technical field in the query's rows, a mapping to the associated description

- A list of failing records, with a clear indication of the column with the failed values

It's important to sort this column in a way that the failure (such as duplicates) is immediately apparent

- Help filling out the correct value (e.g. a list to select from if the values are restricted)

Optionally, the user should have the opportunity to delete records or add records in the target table, if that's marked as a suitable resolution strategy.

Once more assuming you agree with me here, I'm going to the next step and think how to implement the above. While I can envision the change-applying code to take a callback which receives the data required to provide the user with the indicated feedback, no provisions for it have been taken in the "check queries" proposal so far.

I'm wondering: if we name check files as follows: <change>.check.<n>.sql, and include one verification query in each file, then we can introduce a header which provides:

- An explanation of the condition that failed

- An explanation of the available corrective actions

A query which provides the failing records is already provided as that was the idea to introduce check queries in the first place.

Putting the description of what the change is achieving in each check file doesn't seem like a good idea, yet including it in the change file doesn't seem like a good idea either, because that changes the checksum and we don't want to *ever* change the checksum after a change has been created to prevent double application.

Better ideas?

Proceeding to the next topic:
As we intend to get the user feedback in a web-app situation, I think the callback needs to be able to:

- cancel the changes from being applied

- indicate problem resolution in these three variations:

* delete a row

* provide an alternate value

* insert a new row with values provided

I haven't thought about this in depth yet, but it seems we have enough to think about with the above.

Let's continue exchanging views to get the best solution possible.

BTW, this discussion isn't triggered by my desire to move MC to master, but rather my desire to add a few NOT NULL constraints to our database where I *know* there to exist conflicting data.

--

Bye,

Erik.

http://efficito.com -- Hosted accounting and ERP.

Robust and Flexible. No vendor lock-in.