[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal for file attachment API



Hi Luke:

I feel this discussion is getting somewhere.


> My reasoning was this:
>
> If content is not unique, or at least probably unique, you will probably end
> up with many copies of a single document being attached, to for example, a
> quote, an order, an invoice, possibly a purchase order, and maybe a payment
> (just for a most extended case).
>
> I thought that was bad, initially, from a data storage prospective. Storing
> files in a database used to be thought a generally bad idea, but if you're
> going to do it, it seems likely to be a good thing if you don't use the
> storage wastefully: it will have performance effects, possibly might have
> effects when doing cleanup/data recovery/repairs, etc..
>
> The second reason for jumping to uniqueness of contents under the multiple
> links system, was also in part so that a user could upload a file without
> knowing if it was already on-system.  Instead of being stored, a link would
> be created, and the upload thrown away, if the contents were already in the
> repository.
> As one, I can attest that users are lazy.  It may be easier to re-upload ten
> times, than to go hunting the already uploaded copy ten times.
> That doesn't mean the software should have to maintain ten copies.
>
> The third reason why I don't like multiple copies of the same document,
> although this is probably more of an argument for the linking system, is the
> case of contract law.
>
> Hypothetical:
> Manufacturing makes an agreement with a customer, and attaches the contract
> to a quote.
> They email it to Joe in accounting for his approval.
> He suggests some changes by altering the document and sending it back, but
> they tell him to go ahead as-is.
> He puts through an order and invoice to the customer, attaching the file.
> Only what he attaches, is his munged version, accidentally.
> That'll probably become the legally binding version of the agreement.


Ok, so let's break this up into three sets of problems here:

1)  Storage/recovery/cleanup.
2)  Upload of non-unique files-- raises questions see below.
3)  Contract law questions

I am going to go out on a limb here and say that if we try to solve
the third here,  we are going to be here forever.  This is the role of
content management and review systems, and quite frankly if that is a
concern the way to deal with it is to use the x-url functionality I
was suggesting and use Drupal + worlflow module (or other CMS/DMS
solution) to ensure that the contract is properly reviewed before
becoming official.  This is a really big area, and I am not at all
sure we want to venture into that, and certainly not at this time.
This is a case where linking to external systems is really a good
idea.  I suspect this is why most high-end systems force users to use
external document management systems for managing files of this sort.

As for storage/recovery/cleanup....  It's worth noting that in both
these solutions we might suggest (in the future as a nice to have
feature) an ability to delete all files uploaded before a specified
date, perhaps with admin-specified retention rules.  If people need to
store documents for 3 years, we should have a database schema which
supports such selective deletion.

One of the constraints we are dealing with here is PostgreSQL's
handling of antijoins (NOT IN, NOT EXISTS).  Right now performance
there isn't where I would like it to be.  This places an important
constraint on unattached files.  We'd have to be doing everything as
outer joins, and we'd be looking through a lot of data, when looking
for orphaned files.  If we don't have a clear way of marking a file as
unattached, looking for unattached files is going to be performance
intensive.  If, OTOH, we want to allow unattached files and delete
them after a week, putting them in a separate, inheriting table means
we are only looking through a week's worth of data.
>
> I imagine iterations on that kind of case, and my thought for how to limit
> it in software, is to let the order Joe creates, link back to the original
> contract, attached to the quote.
>
> That in and of itself does not require unique contents, but it does require
> the linking scheme.  If you're going to do the linking scheme, it seems a
> small step to make it global, which leads to probably unique contents.

Just to be clear either way we are talking about linking schemes.  We
have to have a linking scheme.  100% agreed there.
>
> (I suppose if you really want to limit that case, we should have file name
> uniqueness at the customer or vendor level, but...oy.)

Actually that could easily be handled in my proposal although it would
quickly become a pain for users if any given customer/vendor ended up
with a lot of documents.....

Really if you want to avoid the "wrong contract attached" problem in
software, use an external content management system with proper
review/workflow modules and link in via URL (something we'd support in
both cases).   The caveat here however is that software will only be
as good as the initial business processes it is backing up.
>
> But, let's consider no unique contents, and no linking.

Linking is required, we agree.  The question about uniqueness is still
a subject of discussion however.
>
> You may end up with at least three copies of each file on the system.  With
> lots of files, the storage requirements for that are going to get absurd.
>
> I, at least, run LSMB in virtual machines.  I don't always grant them a huge
> amount of space, or a huge amount of memory.
> I conserve where I can, and if I don't have to duplicate every 1-5 meg
> proposal or file, 2-3 times per customer, I don't want to.

It's actually worse than that.  Even where space is not an issue, a
lack of linking poses other problems.  For example PO's consolidated,
each with a PO that was printed and shipped to the company....  You
really want to have an audit trail here back to the original purchase
order as it exists in the DB AND the printed PO.  So I think from an
audit perspective we'd want to know where exactly a document entered
the system.  From there it can be linked to within reason (not sure
there is a valid reason to link a CSV file imported to a
Reconciliation report to an invoice, but this is a business logic
issue which can be addressed one way or another in either approach).
>
> Consider a company which attaches a standard contract to every order, or a
> standard SomethingOrOther.
> This probably shouldn't be used for that, but there's a good chance it will
> be.

Ok.  So suppose as an add-on at some point we allow a standard
document module, and allow these to be linked to customer/vendor
facing objects (i.e. transactions, orders, and maybe eventually
customer/vendor records).  Here, you could set up categories of
standard documents, and provide a user interface for pulling them up
and linking them in all without the user uploading the file again.

So in this case, we'd have to define another couple tables:

CREATE TABLE standard_document_category (
id serial not null unique,
label text primary key
);

insert into standard_document_category values ('1', 'service
contract'), ('2', 'nda'), ('3', 'privacy policy');
insert into file_class values ('24', 'Standard documents');

CREATE TABLE file_standard (
foreign key (ref_key) references standard_document_category(id),
check file_class='24'
) inherits (file_base);

CREATE TABLE file_standard_to_tx(
foreign key(ref_key) references transactions(id),
--other constraings
) INHERITS (file_secondary_attachment);

-- same for orders and for entity_credit_accounts

Now these files can be made available to the standard API with no
changes, and the only other changes now are UI changes (allowing for
the searches etc).

We could even make the standard_document_category with a self join and
thus provide something like file paths.

If I may, it seems the only real difference between our proposals at
present (as amended to include secondary links) is whether file
content is a) presumed to be unique and b) separated from it's primary
attachment point.  As far as linking capabilities, I don't see a major
difference in capabilities there.
>
> (Realistically, if that was the system in use, I would avoid passing the
> attachment along the accounting chain like that--I'd put it in the first
> document (quote or order), and refer back to it when I had to.  But I'm not
> everyone.)
>
> N.B. File systems do not require this kind of uniqueness, but the ones which
> assume a level of intelligence in their users, do make it possible, via
> various kinds of links.

I am suggesting two kinds of links:
1)  Secondary mappings of files from one object to another and
2)  Linking to external systems (or even internet resources) via
text/x-url handling
.
>
> If you're really very uncomfortable with it, I'm certainly not going to
> insist upon it, but I do think it makes for a better system in the long run,
> if we try to minimize the number of copies of files as much as possible, but
> maximize the number of documents they can be attached to.
> It's the virtual names (I.E. multiple linking) I most wanted, and the
> uniqueness of contents was a by-product idea that seemed good in retrospect
> from a storage prospective.

100% agreed on the need for linking.  Maybe  I should send a second
proposal for review and discussion.
>
>  I do like your source document reference plan though.
>
> I suppose I am viewing files as their own documents in all this, attachable
> to anything that supports it.
>
>> wondering if the relational model is really well suited for this
>
> To my mind, it's the only model that's perfect for it, perhaps sans the
> primary key issue.
>
> The first virtual FS I ever worked with, was a PostGreSQL backed one,
> although it used non-DB storage.
> In it, any number of paths and names could point to each file.
>
>> problem.  After all we are talking about a huge natural primary key so
>> however we go about enforcing that, there will be a substantial
>> performance cost.
>
> I really wish we could find a way not to use that primary key, or to derive
> a unique short form, so we don't have that problem.
> You have a point about checksums, but there aught to be a way to fingerprint
> a file and do comparisons on that basis.

Thinking about information loss here....  If you have a short form
which contains less information than the original file, then you will
necessarily have false positives.
>
>>> I agree with that, but if we have to...  Do we have other tables in 1.3
>>> where that's the case?
>>
>> Yes.
>>
>> note (abstract table, no acfcess)
>> entity_note (notes for companies and people)
>> eca_note (notes for customer/vendor credit agreements).
>>
>> All could conceivably be queried together with:
>> select * from note, but to insert you must insert into the appropriate
>> subclassed table.
>>
>> There is no attempt to enforce uniqueness of note contents.
>
> Haha!
>
> (Although, if notes were likely to be several hundred K to a few meg each,
> someone would probably suggest it.:))

I think there are two fundamental desires which are necessarily in tension:

1)  storage is more efficient with greater normalization, and
presentation is more flexible
2)  the richness of possible data constraints goes down with greater
nromalization because constraining across relations multiple relations
is generally very costly.

My sense is that this discussion has moved far enough from the
original proposal I would like to submit a new one.  I will send an
amended proposal to the list briefly.

Best Wishes,
Chris Travers