[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal for file attachment API



On Mon, Jul 4, 2011 at 12:33 PM, Luke <..hidden..> wrote:

>>> However, if we combine our two ways of looking at this, I think we have
>>> the solution.
>>>
>>> If you store files by ID, and internally reference them by ID at all
>>> times, they can all be called "foobar.pdf" and it doesn't matter.
>>
>> Ok, so which foobar.pdf attached to a sales order is which one?  i'd
>> rather have some sort of unique constraint on the file name even if it
>> isn't global.  My current thinking is a uniqueness between the primary
>> attachment point and the file name.  So you can only have one
>> "contract.doc" attached to a given sales order.  However, if you have
>
> Exactly.  Enforce name uniqueness at the document level.  "there is already
> a file of that name attached to this document."

Ok, so here's the problem as I see it.

We can't enforce this in Data Definition Language (DDL) without the
link being part of the file table.  Consequently we have to check it
procedurally.  I would prefer to put whatever logic is possible to put
in DDL in DDL because this provides more pervasive enforcement.

There's another side here too:  Suppose we have three orders each with
a file attached to them called "Contract.doc."  We consolidate them.
How do we want to handle the file attachments?  If it were up to me,
I'd be displaying something like:

Contract.doc  Attached from order 42
Contract.doc  Attached from order 43
Contract.doc  Attached from order 44

That way all three can co-exist on an order without folks being
confused as to what they are or which is which.

It seems to me that to support order consolidation we'd still have to
do the same thing (an order-attachment to order) table or else run
into the conflict of names constraint, right?
>
> You're talking about doing it in the database, which takes us back to some
> structure involving a table for each document class.  The way I was thinking
> about (assuming the routine logic would take care of managing it) is
> probably too similar to how 1.2 would have done it, but your suggestion
> below *seems* at first look to get the job done.

I think we are looking at this from two different angles.  I think you
are asking how to maximally decompose the data for greatest
flexibility.  That's not a bad question but I am asking how to enforce
constraints using the data definition language so that there is less
of a question of enforcement or confusion later.  If I understand the
discussion correctly it's an argument between normalization (in higher
normal forms) and constraint dependencies.

To be clear the tradeoff between the two approaches is that my
approach is narrower in the sorts of data it accepts up front, and the
DDL is quite a bit more complex, and this is both a good and a bad
thing.

A simpler DDL (in the sense of not partitioning the file table as per
your proposal) provides greater flexibility.  It also reduces the
representation of the data to the bare minimum therefore allowing more
things to be done with the data on an application level without
changing the schema.  The tradeoff is that we cannot define semantic
constraints on the data through DDL, meaning that the tables cannot be
considered to be a safe API in the future.  If we go with a maximum
decomposition model like yours, I would prefer to change things as
follows:

CREATE TABLE file_storage (
content bytea,
mime_type int,
id serial not null unique,
PRIMARY KEY (content, mime_type),
);

CREATE TABLE file_attach_base (
file_name text not null,
file_class int references file_class(id)
ref_key int,
file_id int not null references file_storage(id),
attach_id serial not null unique,
primary key(file_name, ref_key)
);

And then inheriting out the file_attach_base table, and creating
secondary linking tables as I previously suggested.  This would allow
breaking out:the content into a separate file, but it would probably
make inserts pretty expensive because the unique index of the content
would have to be updated on each insert.  I doubt it would make it
more costly than the method you suggest, and the positive
side would be that hard links would be supported out of the box
because file names would be separate from contents.

It would still require a table inherited from file_attach_base for
handling unattached invoices  but that isn't hard to do.  Something
like:

CREATE TABLE file_import_source (
   id serial not null unique,
   label text primary key
);

insert into table file_import_source(label) values ('email');
insert into table file_import_source(label) values ('ledgersmb:
unattached file addon');

CREATE TABLE file_unattached (
FOREIGN KEY ref_key REFERENCES file_import_source,
CHECK file_class = 4 -- or whatever
);

If we go this route, the tradeoff is flexibility vs performance.  To
insert a file we have to check against the content and MIME type of
all existing files.  With a large number of files, I think that index
would get very, very large.
>
>> a secondary link there (to be implemented in the future) to the sales
>> invoice, the user could see "contract.doc" linked from "sales order
>> 12345."  There could only be one "contract.doc" linked from that sales
>> order to that invoice though there could be additional contract.doc's
>> from other sales orders onto the same invoice.
>>
>>> When a new file is uploaded, compare its CRC/checksum to the index of
>>> stored files.  If there's no match, it's a new file, gets a new ID, and
>>> we
>>> save it. If the checksum matches, compare the contents to the N files
>>> with
>>> the match.  If one of them matches, create a link to the existing copy,
>>> by
>>> inserting a referential entry in whatever table is tracking where files
>>> are attached. If none of them matches, it's a new file.
>>
>> I am still worried about handling naming collisions in a way that is
>> user-transparent.
>
> Here's what I'm envisioning...
>
> There are two public facing database routines which insert files, and one
> which associates them with documents.
>
> The first, inserts a file, and associates it with a document:
> * If the name of the uploaded file is the same as one associated with this
> document, throw an error.
> * It does the checksumish comparison that I spoke of above, without
> reference to name.  If there's a match, create a link instead of storing; if
> not, store the file.

If we go this route and enforce everything in the procedures, we'd
have to say no third party access to the underlying tables, right?

So it seems like we have three options here:

1)  Semantically incomplete DDL constraints, no third party access to
the underlying tables
2)  Semantically complete DDL constraints, maximally decomposed
tables, performance hit, but third party access to underlying tables
3)  Semantically complete DDL constraints, no enforced uniqueness on
file data, third party access to underlying tables

Long run I am not really sure what the operational tradeoffs of such
decomposition are, regarding remapping on the fly and enforcing
uniqueness of data.  My personal preference is to give access to
tables where possible.

Best wishes,
Chris Travers