[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: An Introduction to SODA in LedgerSMB



On Nov 25, 2007 8:09 PM, Joshua D. Drake <..hidden..> wrote:

> Although a correct view it is not a practical view :), which is my
> point. Normalizing out past (really 3rd) 4th form becomes not only a
> serious conceptual problem but an even more serious application problem
> in terms of performance and management.

Practicality depends on a number of things.

First, I see very few if any cases where our own designs are not BCNF
already.  And most of our conventions mean we shouldn't run into
problems with 4nf either.  The issue of semantically atomic fields is
actually a larger concern both with old and new code.

However, here is the basic tradeoff and why we need to approach it as such:

Higher normal forms, used within reasonable restrictions, buy us
(practically-speaking) a more flexible mathematical structure for the
data, so that as business needs change, the structure of the database
will not under any circumstances have to be reorganized (things can be
extended, new tables added, etc, but this is comparatively painless).
For example, there are all sorts of hypothetical business needs which
could force further decomposition of our database, especially where
natural data is concerned.  Also some types of queries are likely to
run faster (including a substantial portion of our reports).

The costs, especially when used in forced ways, include higher
conceptual complexity, potential performance issues for some types of
queries (especially inserts/updates touching large numbers of tables),
and so forth,

In our specific database, I think that higher normalization would mean
faster reporting (in general, some exceptions), but performance costs
on data entry.

TBH, most cases where I have seen normalization problems have not been
due to following mathematical methods, but rather various recipes
relating to modeling philosophies.  For example, with key/value
modelling it is quite easy to conform to high normal forms' math
definitions, but in most cases, key/value modelling is both useless
and dangerous (it is useful in 2 cases in LedgerSMB and both of them
are for storing program variables which have no intrinsic meaning).
In these cases, I would call the database "misnormalized."

Here are rules that I think can avoid most problems:
1)  If a row in a database relation does not contain sufficient
information to qualify as a statement of fact, the database is
misnormalized.

2)  If further normalization does not create the ability to use valid
data constraint practices to better enforce the data integrity, it is
misnormalized.

3)  Duplication of data is not the issue.  One should not argue for
further normalization on that premise alone.  All databases include
duplicate data, even if it is in foreign keys.

I will provide two examples (neither ones are proposals at the moment)
and address them according to the above statements:
1:  Suppose someone argues that the cities and states/provinces should
be tracked according to country in separate tables from street
addresses.  The argument someone might make might be that they might
want to constrain states in at least some countries to those which are
known, and then track cities separately for similar reasons.
a)  each row in the database contains enough information to qualify a
a statement of fact (i.e. there is a city named .... in the state of
..., and the state of ... is in the country of ....
b)  this does provide opportunities to further define valid data.

In short that might be a valid request and we could discuss it in
terms of performance, etc (I am not saying we should now, just saying
that I would probably not dismiss such a request out of hand).

Oh the other hand, suppose someone suggests that some towns
(particularly in places like Utah) have similar address numbering
schemes.  Hence, someone might suggest that we should break out known
street address strings into a separate table.  The argument might be
that this creates a multivalued dependency (and in terms of values and
domains they might be right).

However, a) the rows of the database do not contain valid statements
of fact, IMO, and b) they provide no opportunities to further define
valid data.  I would hope we could dismiss this without much
discussion :-).

BTW, in case you are afraid I am picking on the contact management
side, it is the major part of the database we have gotten to which
stores natural data in it (I consider accounting data to be
artificial) and we have not gotten to things like parts yet.
>
> >
> > I think that the key is to suggest that BCNF through 5NF ought to be
> > acceptable normal forms for this project, and we ought to, where
> > appropriate, push individual relations as far along that progression
> > as makes semantic sense for the data involved.
>
> I do not agree. As someone who deals with normalization issues everyday,
> I am saying, loudly :) that although BCNF through 5NF is *neat* it is
> not something we should "strive" for unless the data model explicitly
> requires it and most do not.

Under what cases do you think that BCNF makes management of data more
difficult than 3NF?  I might be missing something but it seems to me
that if there really are functional dependencies against portions of a
key, then those should be broken out into a separate table anyway to
the extent possible because they represent independent facts.  If it
is not really possible to do so without creating tables which don't
meet my above rules, I would question whether there really is a
functional dependency on a part of the key.

Furthermore, I am thinking very, very hard and I cannot think of any
tables in the new architecture which are not BCNF.


>
> Wait I am confused soda.location[] is an array not a compound type...

It is an array of a compound type.  My mistake :-(  This will be added
in 8,3.,,,,,,


My idea is quite simple:  Define an object data model and interface
model in the database.  I.e. create an object-oriented development
structure which applications of any language can write code generators
to access.  However with this being new functionality in 8.3, I think
it is premature to continue this discussion ;-)

Best Wishes,
Chris Travers