[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal for file attachment API



On Mon, 4 Jul 2011, Chris Travers wrote:

On Sun, Jul 3, 2011 at 11:28 PM, Luke <..hidden..> wrote:

I was expecting that, yes.  However, I shouldn't.  My recent experience is
with reasonably disciplined corporate users, who either get files from
sources with likely to be unique names (some form of the vendor name and
vendor's ID), create files for customers/vendors with names of the same
type, or are good at storing files with rather long, descriptive, and
accidentally unique names.

Are we only worried about corporate users here?  What about small

Not at all.  I said I was wrong to expect that.:)

However, if we combine our two ways of looking at this, I think we have
the solution.

If you store files by ID, and internally reference them by ID at all
times, they can all be called "foobar.pdf" and it doesn't matter.

Ok, so which foobar.pdf attached to a sales order is which one?  i'd
rather have some sort of unique constraint on the file name even if it
isn't global.  My current thinking is a uniqueness between the primary
attachment point and the file name.  So you can only have one
"contract.doc" attached to a given sales order.  However, if you have

Exactly. Enforce name uniqueness at the document level. "there is already a file of that name attached to this document."

You're talking about doing it in the database, which takes us back to some structure involving a table for each document class. The way I was thinking about (assuming the routine logic would take care of managing it) is probably too similar to how 1.2 would have done it, but your suggestion below *seems* at first look to get the job done.

a secondary link there (to be implemented in the future) to the sales
invoice, the user could see "contract.doc" linked from "sales order
12345."  There could only be one "contract.doc" linked from that sales
order to that invoice though there could be additional contract.doc's
from other sales orders onto the same invoice.

When a new file is uploaded, compare its CRC/checksum to the index of
stored files.  If there's no match, it's a new file, gets a new ID, and we
save it. If the checksum matches, compare the contents to the N files with
the match.  If one of them matches, create a link to the existing copy, by
inserting a referential entry in whatever table is tracking where files
are attached. If none of them matches, it's a new file.

I am still worried about handling naming collisions in a way that is
user-transparent.

Here's what I'm envisioning...

There are two public facing database routines which insert files, and one which associates them with documents.

The first, inserts a file, and associates it with a document:
* If the name of the uploaded file is the same as one associated with this document, throw an error. * It does the checksumish comparison that I spoke of above, without reference to name. If there's a match, create a link instead of storing; if not, store the file.

The second, stores the file with no document:
* It does the checksumish comparison that I spoke of above, without reference to name. If there's a match, create a link instead of storing; if not, store the file.

Association function:
* Verify document exists.
* Verify file exists.
* If the name of the pre-uploaded file is the same as one associated with the passed in document reference, throw an error, or let the user rename that particular link.

Obviously, there's crossover and so private function opportunity in that.
However, it's the basic form of what I was thinking.

I'm pushing this, because I think it's more extendable, and it also leads
directly to what Erik wanted.

If you divorce the storage of files, and the way they are tracked, from
the documents to which they are attached, you get a true virtual
filesystem.

If you are going to do a true virtual filesystem I think you'd have to
store a lot more information about each file to avoid the file
collision issue.  I personally would get cranky if I had to look
through 34 foobar.pdf's to figure out which one was the one I wanted.

If most files are attached to documents [assumption], you won't have to look through 34: look for the one associated with the document associated with your customer of interest, or similar. It becomes a reporting issue.

If you're looking through unattached uploades, maybe you can narrow by upload date, sort in some meaningful way (newest to oldist), or in some other way prioritize to help you find it quickly.

Also, one hopes that the descriptions would help.

Yes, it *could* be a problem, but with reasonable common sense behavior, it probably won't be in most cases.

 > Traditional filesystems get around this problem by requiring unique
paths.  We could do the same here but either the paths end up
meaningless or they end up impossible to enforce against different
database relations.

I don't think any path-like construct will work well.

The only thing I see as needing to be unique, is file contents. Names only matter at the document level. If users don't use good descriptions, they bring confusion upon themselves; and most files attached to documents will probably be self-explanatory anyway (I hope that's not too optimistic).

Hmm, if you did have to use paths, the vendor or customer name might be a good left hand portion, although I still don't like paths.

So the question I have is that if we go this way, how do we handle
that?  Over what domains do we require unique file names, and how do
we expose this to the user?

Well, because of what Erik wants (and I can see value in it), we will eventually need some sort of way to look at all files in overview--a file manager. It will have to display the name, the description, and the associated document(s), minimally.

Thinking about that, I am picturing a files item off the main menu. It would have a generic "add file" option, and a "reports" option.

The report could show the data I spoke of for files selected in some way (by upload date, by size, by type, by customer, by documents attached to, whatever). It would return a list of files, similar to a list of orders, including (usually/by default), the file name, description, and document(s) to which it's attached.

A file object, once drilled down to, could have the ability to assign it to new/other documents, or remove existing assignments, if the user's permissions were high enough.
If it's assigned to nothing, it could be deleted entirely.

Any document can point to any file(s), and any file can be pointed to by
one/some/no documents.

Associations can be re-mapped after file storage (this assumes a file
management UI at some point), which is necessary for Erik's suggestion.

The second addition I was thinking about a different way of attaching
secondary linking.

CREATE TABLE secondary_file_mapping AS (
file_id int not null,
ref_key int not null
attach_class int not null references file_class(id)
);

CREATE TABLE attach_tx_file_to_oe AS (
FOREIGN KEY file_id REFERENCES file_transaction(id),
FOREIGN KEY ref_key REFERENCES oe(id),
CHECK (attach_class = 2)
) INHERITS (secondary_file_mapping);

CREATE TABLE attach_oe_file_to_tx AS (
FOREIGN KEY file_id REFERENCES file_order(id),
FOREIGN KEY ref_key REFERENCES transaction(id),
CHECK(attach_class = 1)
) INHERITS (secondary_file_mapping);

CREATE TABLE attach_oe_file_to_oe AS (
FOREIGN KEY file_id REFERENCES file_order(id),
FOREIGN KEY ref_key REFERENCES oe(id),
CHECK (attach_class = 2)
) INHERITS (secondary_file_mapping); -- for order consolidation, etc

This gets more complicated than your solution, Luke and that's a
legitimate question of tradeoffs.  But I am wondering if a real

Yeah. I see what you're doing. I don't know enough about advanced PostGreSQL to know if it has to be done that way or not. It seems that this assumes some modifications (restriction removal) to your tables from the other day--it might be useful to see the whole file with these added.

virtual filesystem is desired if it would be better to use a mature

I may have thrown around the term a bit too loosely. This is a limited virtual FS--virtual only in the sense that file contents is not linked to name or location [association], and the file can be pointed to by any number of equally valid link points.

 > document management solution instead and use the proposed x-url

I started trying to look at those, when I saw your original proposal to me and Alvin Starr for doing the original form of this. That's why I didn't respond right away: I wanted to see if I could find some document manager which would fit the bill. To date, I haven't. User management is a big sticking point.

Note here you can get all files by "select  * from file_attachment"
and all secondary links by "select * from secondary_file_mapping."

I don't currently see a way around the naming collision issues across
wide ranges of users which strike me as very manageable, but I'd be
interested in hearing suggestions.

The best I have right now is above.

Luke