[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal for file attachment API



On Tue, 5 Jul 2011, Chris Travers wrote:

I guess there is one thing that's bothering me about this discussion,
The thing that bothers me (slightly) is that you and I seem to be the only 
people with opinions on this.
and it's worth bringing up.  I am not aware of a single filesystem
which attempts to enforce uniqueness on file data.  I would think if
it was a significant problem, it would have been tackled there first.
My reasoning was this:

If content is not unique, or at least probably unique, you will probably end up with many copies of a single document being attached, to for example, a quote, an order, an invoice, possibly a purchase order, and maybe a payment (just for a most extended case).
I thought that was bad, initially, from a data storage prospective. 
Storing files in a database used to be thought a generally bad idea, but 
if you're going to do it, it seems likely to be a good thing if you don't 
use the storage wastefully: it will have performance effects, possibly 
might have effects when doing cleanup/data recovery/repairs, etc..
The second reason for jumping to uniqueness of contents under the multiple 
links system, was also in part so that a user could upload a file without 
knowing if it was already on-system.  Instead of being stored, a link 
would be created, and the upload thrown away, if the contents were already 
in the repository.
As one, I can attest that users are lazy.  It may be easier to re-upload 
ten times, than to go hunting the already uploaded copy ten times.
That doesn't mean the software should have to maintain ten copies.

The third reason why I don't like multiple copies of the same document, although this is probably more of an argument for the linking system, is the case of contract law.
Hypothetical:
Manufacturing makes an agreement with a customer, and attaches the contract to a quote.
They email it to Joe in accounting for his approval.
He suggests some changes by altering the document and sending it back, but they tell him to go ahead as-is.
He puts through an order and invoice to the customer, attaching the file.
Only what he attaches, is his munged version, accidentally.
That'll probably become the legally binding version of the agreement.

I imagine iterations on that kind of case, and my thought for how to limit it in software, is to let the order Joe creates, link back to the original contract, attached to the quote.
That in and of itself does not require unique contents, but it does 
require the linking scheme.  If you're going to do the linking scheme, it 
seems a small step to make it global, which leads to probably unique 
contents.
(I suppose if you really want to limit that case, we should have file 
name uniqueness at the customer or vendor level, but...oy.)
But, let's consider no unique contents, and no linking.

You may end up with at least three copies of each file on the system. With lots of files, the storage requirements for that are going to get absurd.
I, at least, run LSMB in virtual machines.  I don't always grant them a 
huge amount of space, or a huge amount of memory.
I conserve where I can, and if I don't have to duplicate every 1-5 meg 
proposal or file, 2-3 times per customer, I don't want to.
Consider a company which attaches a standard contract to every order, or a 
standard SomethingOrOther.
This probably shouldn't be used for that, but there's a good chance it 
will be.
(Realistically, if that was the system in use, I would avoid passing 
the attachment along the accounting chain like that--I'd put it in the 
first document (quote or order), and refer back to it when I had to.  But 
I'm not everyone.)
N.B. File systems do not require this kind of uniqueness, but the ones 
which assume a level of intelligence in their users, do make it possible, 
via various kinds of links.
If you're really very uncomfortable with it, I'm certainly not going to 
insist upon it, but I do think it makes for a better system in the long 
run, if we try to minimize the number of copies of files as much as 
possible, but maximize the number of documents they can be attached to.
It's the virtual names (I.E. multiple linking) I most wanted, and the 
uniqueness of contents was a by-product idea that seemed good in 
retrospect from a storage prospective.
 I do like your source document reference plan though.

I suppose I am viewing files as their own documents in all this, attachable to anything that supports it.
wondering if the relational model is really well suited for this
To my mind, it's the only model that's perfect for it, perhaps sans the 
primary key issue.
The first virtual FS I ever worked with, was a PostGreSQL backed one, 
although it used non-DB storage.
In it, any number of paths and names could point to each file.

problem.  After all we are talking about a huge natural primary key so
however we go about enforcing that, there will be a substantial
performance cost.
I really wish we could find a way not to use that primary key, or to 
derive a unique short form, so we don't have that problem.
You have a point about checksums, but there aught to be a way to 
fingerprint a file and do comparisons on that basis.
I agree with that, but if we have to...  Do we have other tables in 1.3
where that's the case?
Yes.

note (abstract table, no acfcess)
entity_note (notes for companies and people)
eca_note (notes for customer/vendor credit agreements).

All could conceivably be queried together with:
select * from note, but to insert you must insert into the appropriate
subclassed table.

There is no attempt to enforce uniqueness of note contents.
Haha!

(Although, if notes were likely to be several hundred K to a few meg each, someone would probably suggest it.:))
Luke