[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal for file attachment API



On Mon, 4 Jul 2011, Chris Travers wrote:

On Mon, Jul 4, 2011 at 12:33 PM, Luke <..hidden..> wrote:

However, if we combine our two ways of looking at this, I think we have
the solution.

If you store files by ID, and internally reference them by ID at all
times, they can all be called "foobar.pdf" and it doesn't matter.

Ok, so which foobar.pdf attached to a sales order is which one?  i'd
rather have some sort of unique constraint on the file name even if it
isn't global.  My current thinking is a uniqueness between the primary
attachment point and the file name.  So you can only have one
"contract.doc" attached to a given sales order.  However, if you have

Exactly.  Enforce name uniqueness at the document level.  "there is already
a file of that name attached to this document."

Ok, so here's the problem as I see it.

We can't enforce this in Data Definition Language (DDL) without the
link being part of the file table.  Consequently we have to check it
procedurally.  I would prefer to put whatever logic is possible to put
in DDL in DDL because this provides more pervasive enforcement.

There's another side here too:  Suppose we have three orders each with
a file attached to them called "Contract.doc."  We consolidate them.

Hadn't thought of that.

How do we want to handle the file attachments?  If it were up to me,
I'd be displaying something like:

Contract.doc  Attached from order 42
Contract.doc  Attached from order 43
Contract.doc  Attached from order 44

Seems reasonable enough. So instead of being unique to the filename, you'd be unique to the filename and the source attachment point?

The problem I see there, is what happens if order 42 is deleted after the consolidation? To prevent that, you probably have to snapshot the data, which is not elegant at all..

That way all three can co-exist on an order without folks being
confused as to what they are or which is which.

It seems to me that to support order consolidation we'd still have to
do the same thing (an order-attachment to order) table or else run
into the conflict of names constraint, right?

It is seeming like that.

Either that, or during a consolidation or addition operation, you do automatic pre-extension numbering.

I.E. The file from order 42 is contract1.doc, the one from order 43 becomes contract2.doc, and the one from order 44 becomes contract3.doc.

Or OrdNumber-contract.doc, so: 42-contract.doc, 43-contract.doc, 44-contract.doc, which is probably better.

The user can possibly rename them before saving?

You're talking about doing it in the database, which takes us back to some
structure involving a table for each document class.  The way I was thinking
about (assuming the routine logic would take care of managing it) is
probably too similar to how 1.2 would have done it, but your suggestion
below *seems* at first look to get the job done.

I think we are looking at this from two different angles.  I think you
are asking how to maximally decompose the data for greatest
flexibility.  That's not a bad question but I am asking how to enforce
constraints using the data definition language so that there is less
of a question of enforcement or confusion later.  If I understand the
discussion correctly it's an argument between normalization (in higher
normal forms) and constraint dependencies.

Yes.

To be clear the tradeoff between the two approaches is that my
approach is narrower in the sorts of data it accepts up front, and the
DDL is quite a bit more complex, and this is both a good and a bad
thing.

I have no objection to the DDL complexity, if the parts I want flexible can be made so.:) I still envision most of this being black-boxed into API calls or database routines, from the third party access prospective.

A simpler DDL (in the sense of not partitioning the file table as per
your proposal) provides greater flexibility.  It also reduces the
representation of the data to the bare minimum therefore allowing more
things to be done with the data on an application level without
changing the schema.  The tradeoff is that we cannot define semantic
constraints on the data through DDL, meaning that the tables cannot be
considered to be a safe API in the future.  If we go with a maximum
decomposition model like yours, I would prefer to change things as
follows:

CREATE TABLE file_storage (
content bytea,
mime_type int,
id serial not null unique,
PRIMARY KEY (content, mime_type),

Really?  Primary key on a blob?  Didn't even know you could do that.
However, why? Why do it on content, instead of generating a checksom or similar up front, and keying on that?

You dropped that from my version, but didn't say why.
Have I missed something there?

);

CREATE TABLE file_attach_base (
file_name text not null,
file_class int references file_class(id)
ref_key int,
file_id int not null references file_storage(id),
attach_id serial not null unique,
primary key(file_name, ref_key)
);

And then inheriting out the file_attach_base table, and creating
secondary linking tables as I previously suggested.  This would allow
breaking out:the content into a separate file, but it would probably
make inserts pretty expensive because the unique index of the content
would have to be updated on each insert.  I doubt it would make it
more costly than the method you suggest, and the positive
side would be that hard links would be supported out of the box
because file names would be separate from contents.

Yes, I like that, and I like the design so far I think, except for that contents primary key.
Please clarify why we have to do that.

Having a checksum run on uploades, and then indexing that, is going to be slower you think?

Using the cksum utility on a 100 MB file of random data (dd'ed from urandom), on a 1.6GHZ laptop:
real 5.62
user 0.47
sys 1.95

Given post limits and such, most uploads are going to be 1-2% of that size.

Is it just that indexing the content, absolutely insures no duplicates, whereas checksumming it only probably does?

It would still require a table inherited from file_attach_base for
handling unattached invoices  but that isn't hard to do.  Something
like:

CREATE TABLE file_import_source (
  id serial not null unique,
  label text primary key
);

insert into table file_import_source(label) values ('email');
insert into table file_import_source(label) values ('ledgersmb:
unattached file addon');

CREATE TABLE file_unattached (
FOREIGN KEY ref_key REFERENCES file_import_source,
CHECK file_class = 4 -- or whatever
);

What would happen if a file becomes unattached by virtue of document deletion, or intentional dis-association?
Triggers?

If we go this route, the tradeoff is flexibility vs performance.  To
insert a file we have to check against the content and MIME type of
all existing files.  With a large number of files, I think that index
would get very, very large.

Agreed, which is why I suggest checksums.

a secondary link there (to be implemented in the future) to the sales
invoice, the user could see "contract.doc" linked from "sales order
12345."  There could only be one "contract.doc" linked from that sales
order to that invoice though there could be additional contract.doc's
from other sales orders onto the same invoice.

When a new file is uploaded, compare its CRC/checksum to the index of
stored files.  If there's no match, it's a new file, gets a new ID, and
we
save it. If the checksum matches, compare the contents to the N files
with
the match.  If one of them matches, create a link to the existing copy,
by
inserting a referential entry in whatever table is tracking where files
are attached. If none of them matches, it's a new file.

I am still worried about handling naming collisions in a way that is
user-transparent.

Here's what I'm envisioning...

There are two public facing database routines which insert files, and one
which associates them with documents.

The first, inserts a file, and associates it with a document:
* If the name of the uploaded file is the same as one associated with this
document, throw an error.
* It does the checksumish comparison that I spoke of above, without
reference to name.  If there's a match, create a link instead of storing; if
not, store the file.

If we go this route and enforce everything in the procedures, we'd
have to say no third party access to the underlying tables, right?

Yes. I was assuming that all file access would be handled through the procedures, or by views.

So it seems like we have three options here:

1)  Semantically incomplete DDL constraints, no third party access to
the underlying tables
2)  Semantically complete DDL constraints, maximally decomposed
tables, performance hit, but third party access to underlying tables

Isn't your arrangement above, and in the previous message (secondary linkages), leading us toward option 2, but with option 1's private table access?

3)  Semantically complete DDL constraints, no enforced uniqueness on
file data, third party access to underlying tables

#2 seems best to me, if it can be done. I would like to preserve uniqueness of files as much as possible for several reasons which I have discussed at length. That means that files should be able to have multiple names, and should be linkable to multiple documents of various types. If we can do all that in the database without relying on procedures to pull it all together, then splendid.

> Long run I am not really sure what the operational
tradeoffs of such
decomposition are, regarding remapping on the fly and enforcing
uniqueness of data.  My personal preference is to give access to
tables where possible.

I agree with that, but if we have to... Do we have other tables in 1.3 where that's the case?

Luke