On Sat, 2 Jul 2011, Chris Travers wrote:
On Sat, Jul 2, 2011 at 1:16 PM, Luke <..hidden..> wrote:Probably though, as I think about it, this would require globally unique filenames, and a name comparison with new uploads, possibly followed by a content comparison if names match. I'm not sure globally unique filenames are such a bad idea anyway.There's a fairly nasty case here that you can run into. If globally unique file names are required, then how do you know in advance what sort of names are used? Do we want to expect the users of the system to all come up with naming conventions that avoid collisions?
I was expecting that, yes. However, I shouldn't. My recent experience is with reasonably disciplined corporate users, who either get files from sources with likely to be unique names (some form of the vendor name and vendor's ID), create files for customers/vendors with names of the same type, or are good at storing files with rather long, descriptive, and accidentally unique names.
However, if we combine our two ways of looking at this, I think we have the solution.
If you store files by ID, and internally reference them by ID at all times, they can all be called "foobar.pdf" and it doesn't matter.
When a new file is uploaded, compare its CRC/checksum to the index of stored files. If there's no match, it's a new file, gets a new ID, and we save it. If the checksum matches, compare the contents to the N files with the match. If one of them matches, create a link to the existing copy, by inserting a referential entry in whatever table is tracking where files are attached. If none of them matches, it's a new file.
I'm pushing this, because I think it's more extendable, and it also leads directly to what Erik wanted.
If you divorce the storage of files, and the way they are tracked, from the documents to which they are attached, you get a true virtual filesystem. Any document can point to any file(s), and any file can be pointed to by one/some/no documents.
Associations can be re-mapped after file storage (this assumes a file management UI at some point), which is necessary for Erik's suggestion.
Luke