[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal for file attachment API

On Tue, 5 Jul 2011, Chris Travers wrote:

I guess there is one thing that's bothering me about this discussion,

The thing that bothers me (slightly) is that you and I seem to be the only people with opinions on this.

and it's worth bringing up.  I am not aware of a single filesystem
which attempts to enforce uniqueness on file data.  I would think if
it was a significant problem, it would have been tackled there first.

My reasoning was this:

If content is not unique, or at least probably unique, you will probably end up with many copies of a single document being attached, to for example, a quote, an order, an invoice, possibly a purchase order, and maybe a payment (just for a most extended case).

I thought that was bad, initially, from a data storage prospective. Storing files in a database used to be thought a generally bad idea, but if you're going to do it, it seems likely to be a good thing if you don't use the storage wastefully: it will have performance effects, possibly might have effects when doing cleanup/data recovery/repairs, etc..

The second reason for jumping to uniqueness of contents under the multiple links system, was also in part so that a user could upload a file without knowing if it was already on-system. Instead of being stored, a link would be created, and the upload thrown away, if the contents were already in the repository. As one, I can attest that users are lazy. It may be easier to re-upload ten times, than to go hunting the already uploaded copy ten times.
That doesn't mean the software should have to maintain ten copies.

The third reason why I don't like multiple copies of the same document, although this is probably more of an argument for the linking system, is the case of contract law.

Manufacturing makes an agreement with a customer, and attaches the contract to a quote.
They email it to Joe in accounting for his approval.
He suggests some changes by altering the document and sending it back, but they tell him to go ahead as-is.
He puts through an order and invoice to the customer, attaching the file.
Only what he attaches, is his munged version, accidentally.
That'll probably become the legally binding version of the agreement.

I imagine iterations on that kind of case, and my thought for how to limit it in software, is to let the order Joe creates, link back to the original contract, attached to the quote.

That in and of itself does not require unique contents, but it does require the linking scheme. If you're going to do the linking scheme, it seems a small step to make it global, which leads to probably unique contents.

(I suppose if you really want to limit that case, we should have file name uniqueness at the customer or vendor level, but...oy.)

But, let's consider no unique contents, and no linking.

You may end up with at least three copies of each file on the system. With lots of files, the storage requirements for that are going to get absurd.

I, at least, run LSMB in virtual machines. I don't always grant them a huge amount of space, or a huge amount of memory. I conserve where I can, and if I don't have to duplicate every 1-5 meg proposal or file, 2-3 times per customer, I don't want to.

Consider a company which attaches a standard contract to every order, or a standard SomethingOrOther. This probably shouldn't be used for that, but there's a good chance it will be.

(Realistically, if that was the system in use, I would avoid passing the attachment along the accounting chain like that--I'd put it in the first document (quote or order), and refer back to it when I had to. But I'm not everyone.)

N.B. File systems do not require this kind of uniqueness, but the ones which assume a level of intelligence in their users, do make it possible, via various kinds of links.

If you're really very uncomfortable with it, I'm certainly not going to insist upon it, but I do think it makes for a better system in the long run, if we try to minimize the number of copies of files as much as possible, but maximize the number of documents they can be attached to. It's the virtual names (I.E. multiple linking) I most wanted, and the uniqueness of contents was a by-product idea that seemed good in retrospect from a storage prospective.

 I do like your source document reference plan though.

I suppose I am viewing files as their own documents in all this, attachable to anything that supports it.

wondering if the relational model is really well suited for this

To my mind, it's the only model that's perfect for it, perhaps sans the primary key issue.

The first virtual FS I ever worked with, was a PostGreSQL backed one, although it used non-DB storage.
In it, any number of paths and names could point to each file.

problem.  After all we are talking about a huge natural primary key so
however we go about enforcing that, there will be a substantial
performance cost.

I really wish we could find a way not to use that primary key, or to derive a unique short form, so we don't have that problem. You have a point about checksums, but there aught to be a way to fingerprint a file and do comparisons on that basis.

I agree with that, but if we have to...  Do we have other tables in 1.3
where that's the case?


note (abstract table, no acfcess)
entity_note (notes for companies and people)
eca_note (notes for customer/vendor credit agreements).

All could conceivably be queried together with:
select * from note, but to insert you must insert into the appropriate
subclassed table.

There is no attempt to enforce uniqueness of note contents.


(Although, if notes were likely to be several hundred K to a few meg each, someone would probably suggest it.:))