[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Proposal for file attachment API
- Subject: Re: Proposal for file attachment API
- From: Chris Travers <..hidden..>
- Date: Tue, 5 Jul 2011 00:04:17 -0700
On Mon, Jul 4, 2011 at 9:05 PM, Luke <..hidden..> wrote:
> Seems reasonable enough. So instead of being unique to the filename, you'd
> be unique to the filename and the source attachment point?
>
> The problem I see there, is what happens if order 42 is deleted after the
> consolidation?
> To prevent that, you probably have to snapshot the data, which is not
> elegant at all..
Or say "order can be closed but not deleted." which would be the
result of what I am suggesting at the moment.
>
>> That way all three can co-exist on an order without folks being
>> confused as to what they are or which is which.
>>
>> It seems to me that to support order consolidation we'd still have to
>> do the same thing (an order-attachment to order) table or else run
>> into the conflict of names constraint, right?
>
> It is seeming like that.
>
> Either that, or during a consolidation or addition operation, you do
> automatic pre-extension numbering.
>
> I.E. The file from order 42 is contract1.doc, the one from order 43 becomes
> contract2.doc, and the one from order 44 becomes contract3.doc.
>
> Or OrdNumber-contract.doc, so: 42-contract.doc, 43-contract.doc,
> 44-contract.doc, which is probably better.
>
> The user can possibly rename them before saving?
I think if we are going to be doing this, I think that allowing
deletion of objects to which files are originally attached is a bad
idea. If the original order still exists in the system even if it
isn't open, one can still go back and review what was entered there.
Otherwise you essentially have no audit trail.
Indeed, I think the way we should be moving is that: 1) No referenced
data should ever be deleted, ever and 2) No approved financial data
should ever be deleted, ever.
In practical terms this means:
1) One cannot delete an order with a file that is secondarily mapped
elsewhere, and
2) One cannot delete a posted invoice under any circumstances.
3) If a pending reconciliation report (including an attached bank
statement) is deleted, the sproc can handle that deletion, or the DDL
can be set up such that the deletion occurs automatically.....
However an approved reconciliation report must NEVER be deleted under
any circumstances.
> Yes.
>
>> To be clear the tradeoff between the two approaches is that my
>> approach is narrower in the sorts of data it accepts up front, and the
>> DDL is quite a bit more complex, and this is both a good and a bad
>> thing.
>
> I have no objection to the DDL complexity, if the parts I want flexible can
> be made so.:)
> I still envision most of this being black-boxed into API calls or database
> routines, from the third party access prospective.
Ok, then let's talk about the specific, non-standard PostgreSQL
feature I am talking about relying on: table inheritance.
PostgreSQL allows inheritance on tables and is actually pretty good at
optimizing queries across inheritance trees. What this means is that
a table can be logically partitioned for performance reasons or (as we
do in 1.3), to provide a consistent API across related portions of the
application by allowing different partitions to have different foreign
key constraints. The biggest disadvantage is that primary keys and
unique constraints cannot be enforced across the entire inheritance
tree.
Here's a simple example:
CREATE TABLE city (
id serial,
city_name text,
state text,
country_id int,
primary key (city_name, state, country_id)
);
CREATE TABLE national_capital (unique(country_id)) INHERITS (city);
INSERT INTO city(city_name, state, country_id) values ('Detroit',
'MI', 1), ('Seattle', 'WA', '1'), ('Jogjakarta', 'Central Java', '2');
INSERT INTO national_capital (city_name, state, country_id) values
('Jakarta', 'West Java', '2');
SELECT * FROM city;
test=# select * from city;
id | city_name | state | country_id
----+------------+--------------+------------
1 | Detroit | MI | 1
2 | Seattle | WA | 1
3 | Jogjakarta | Central Java | 2
4 | Jakarta | West Java | 2
(4 rows)
You don't have to search inheritance trees. You can specify a lack of
recursion as follows:
test=# select * from only city;
id | city_name | state | country_id
----+------------+--------------+------------
1 | Detroit | MI | 1
2 | Seattle | WA | 1
3 | Jogjakarta | Central Java | 2
(3 rows)
But if you insert Jakarta also into the base table, we get the record
appearing twice:
INSERT INTO city (city_name, state, country_id) values ('Jakarta',
'West Java', '2');
test=# select * from city;
id | city_name | state | country_id
----+------------+--------------+------------
1 | Detroit | MI | 1
2 | Seattle | WA | 1
3 | Jogjakarta | Central Java | 2
5 | Jakarta | West Java | 2
4 | Jakarta | West Java | 2
(5 rows)
The other limitation of doing things in this way is that foreign keys
cannot reach down the inheritance tree. This typically means that
tables which must reference an inherited tree of tables must either
reference a specific table in that tree or else must be partitioned
themselves through the same process.
Requiring that the base table is never used and that only sub-tables
can be inserted into, and that the data there is mutually exclusive
avoids this problem though.
Consequently, in my initial proposal, the file base table can never
have inserts on it. Hence no write access to that table. The table
itself functions as a view which is essentially dynamically extended
through the use of inheriting tables. We do this elsewhere with an
abstract notes table (which never has records in it) and inherited
tables which store, for example, notes on companies and credit
accounts.
Now what this means is that any other file mappings can be added as
needed simply by inheriting out the base tables and adding relevant
file classes. This provides no difference to the API and allows a
single field in the base table to reference id's in any number of
other tables, and have referential integrity properly enforced.
Consequently, if an add-on (maybe for job costing?) wants to add:
CREATE TABLE file_orderitem (
foreign key(ref_key) references orderitems(id),
check file_class=10
);
This gets automatically added to the view and no changes to the API
are required to retrieve the file or list those attached to the line
item of the order. In essence the table gets plugged in. Inheritance
is extremely flexible in this way. where it is not quite as good is
where it comes to the secondary mappings..... There's complexity
there which there is no denying, but that's in part because that gives
us a fair bit of flexibility in what we can enforce (for example, can
a file uploaded to a reconciliation report be attached to an invoice
or an order? We can say yes or no by supplying or not the inheriting
tables.
>
>> A simpler DDL (in the sense of not partitioning the file table as per
>> your proposal) provides greater flexibility. It also reduces the
>> representation of the data to the bare minimum therefore allowing more
>> things to be done with the data on an application level without
>> changing the schema. The tradeoff is that we cannot define semantic
>> constraints on the data through DDL, meaning that the tables cannot be
>> considered to be a safe API in the future. If we go with a maximum
>> decomposition model like yours, I would prefer to change things as
>> follows:
>>
>> CREATE TABLE file_storage (
>> content bytea,
>> mime_type int,
>> id serial not null unique,
>> PRIMARY KEY (content, mime_type),
>
> Really? Primary key on a blob? Didn't even know you could do that.
> However, why? Why do it on content, instead of generating a checksom or
> similar up front, and keying on that?
Checksums are not guaranteed to be unique. But you can index ANYTHING
in PostgreSQL, it turns out... and if content is supposed to be
unique, it is a natural primary key.
It is the only real way to enforce, in DDL, content uniqueness.
Otherwse we are talking about triggers which check uniqueness against
the md5sum() (since that's that supplied in PostgreSQL by default)
first and then that content doesn't match on selected rows....... One
can always go that route but I think the DDL data constraints are
clearer, and if one has to move from them and simulate in triggers, it
is usually helpful to have a sketch in DDL first.
> Yes, I like that, and I like the design so far I think, except for that
> contents primary key.
> Please clarify why we have to do that.
>
> Having a checksum run on uploades, and then indexing that, is going to be
> slower you think?
It means you have to have a trigger as a custom procedure to enforce
uniqueness on content. If we go that route, it's best to sketch out
what is valid data purely in DDL first.
>
> Using the cksum utility on a 100 MB file of random data (dd'ed from
> urandom), on a 1.6GHZ laptop:
> real 5.62
> user 0.47
> sys 1.95
>
> Given post limits and such, most uploads are going to be 1-2% of that size.
>
> Is it just that indexing the content, absolutely insures no duplicates,
> whereas checksumming it only probably does?
If you uniquely index on the checksum, you do indeed ensure no
duplicates. But you also ensure nothing with a colliding checksum can
be uploaded.......
>
>> It would still require a table inherited from file_attach_base for
>> handling unattached invoices but that isn't hard to do. Something
>> like:
>>
>> CREATE TABLE file_import_source (
>> id serial not null unique,
>> label text primary key
>> );
>>
>> insert into table file_import_source(label) values ('email');
>> insert into table file_import_source(label) values ('ledgersmb:
>> unattached file addon');
>>
>> CREATE TABLE file_unattached (
>> FOREIGN KEY ref_key REFERENCES file_import_source,
>> CHECK file_class = 4 -- or whatever
>> );
>
> What would happen if a file becomes unattached by virtue of document
> deletion, or intentional dis-association?
> Triggers?
By default, unless we specify otherwise, the delete would fail. We
CAN specify that the mapping goes away too, but I'd prefer not to
allow deletion of referenced data unless we discuss the implications
first.
>
>> If we go this route, the tradeoff is flexibility vs performance. To
>> insert a file we have to check against the content and MIME type of
>> all existing files. With a large number of files, I think that index
>> would get very, very large.
>
> Agreed, which is why I suggest checksums.
>
>>>> a secondary link there (to be implemented in the future) to the sales
>>>> invoice, the user could see "contract.doc" linked from "sales order
>>>> 12345." There could only be one "contract.doc" linked from that sales
>>>> order to that invoice though there could be additional contract.doc's
>>>> from other sales orders onto the same invoice.
>>>>
>>>>> When a new file is uploaded, compare its CRC/checksum to the index of
>>>>> stored files. If there's no match, it's a new file, gets a new ID, and
>>>>> we
>>>>> save it. If the checksum matches, compare the contents to the N files
>>>>> with
>>>>> the match. If one of them matches, create a link to the existing copy,
>>>>> by
>>>>> inserting a referential entry in whatever table is tracking where files
>>>>> are attached. If none of them matches, it's a new file.
>>>>
>>>> I am still worried about handling naming collisions in a way that is
>>>> user-transparent.
>>>
>>> Here's what I'm envisioning...
>>>
>>> There are two public facing database routines which insert files, and one
>>> which associates them with documents.
>>>
>>> The first, inserts a file, and associates it with a document:
>>> * If the name of the uploaded file is the same as one associated with
>>> this
>>> document, throw an error.
>>> * It does the checksumish comparison that I spoke of above, without
>>> reference to name. If there's a match, create a link instead of storing;
>>> if
>>> not, store the file.
>>
>> If we go this route and enforce everything in the procedures, we'd
>> have to say no third party access to the underlying tables, right?
>
> Yes. I was assuming that all file access would be handled through the
> procedures, or by views.
>
>> So it seems like we have three options here:
>>
>> 1) Semantically incomplete DDL constraints, no third party access to
>> the underlying tables
>> 2) Semantically complete DDL constraints, maximally decomposed
>> tables, performance hit, but third party access to underlying tables
>
> Isn't your arrangement above, and in the previous message (secondary
> linkages), leading us toward option 2, but with option 1's private table
> access?
>
>> 3) Semantically complete DDL constraints, no enforced uniqueness on
>> file data, third party access to underlying tables
>
> #2 seems best to me, if it can be done. I would like to preserve uniqueness
> of files as much as possible for several reasons which I have discussed at
> length. That means that files should be able to have multiple names, and
> should be linkable to multiple documents of various types.
> If we can do all that in the database without relying on procedures to pull
> it all together, then splendid.
I guess there is one thing that's bothering me about this discussion,
and it's worth bringing up. I am not aware of a single filesystem
which attempts to enforce uniqueness on file data. I would think if
it was a significant problem, it would have been tackled there first.
Instead there is usually some uniqueness constraint on, if you will,
attachment point (path as currently mounted, for example). I am
wondering if the relational model is really well suited for this
problem. After all we are talking about a huge natural primary key so
however we go about enforcing that, there will be a substantial
performance cost.
>
> > Long run I am not really sure what the operational
>>
>> tradeoffs of such
>> decomposition are, regarding remapping on the fly and enforcing
>> uniqueness of data. My personal preference is to give access to
>> tables where possible.
>
> I agree with that, but if we have to... Do we have other tables in 1.3
> where that's the case?
Yes.
note (abstract table, no acfcess)
entity_note (notes for companies and people)
eca_note (notes for customer/vendor credit agreements).
All could conceivably be queried together with:
select * from note, but to insert you must insert into the appropriate
subclassed table.
There is no attempt to enforce uniqueness of note contents.
Best Wishes,
Chris Travers