The Adventures of Joshua Judson Rosen
(action man)

[ sections: VisualIDs | art | movies | everything ]


Sun, 07 Sep 2008
[@]

20:58: Computing Similarity, and Computing Difference

Way back in June or July, I had dinner with Chris and Allli, and I got to do a little demo to show-off my VisualIDs-in-Nautilus work as far as it had progressed at that point. Since I'd just started thinking about the problem of identifying similar files in a global context, and was (somewhat stupidly) proud of myself for having come up with a way of seemingly making it easier than I'd initially expected it to be, I raised it in conversation; regrettably, this (along with the talk about my new job) resulted Chris and I marooning Allli in geekspeak....

When I initially read the essay, this part (like so many other parts) looked great (`on paper', as they say?). When I initially dug in as an implementor, this part (like so many other parts) looked more half-baked: deriving icons in a group from the same source made fine sense, but how were the groups to be formed?

But then I thought about it some more, and it occurred to me that, since I was keeping a cache of VisualIDs and the names to which they belonged, I could just scan through the cache whenever a new VisualID needed to be generated, see if I could find an appropriately-similar base-ID, and then go from there--this would be where the `longer than 3 characters' part of the matching-algorithm came in. All I would need to do in order to guarantee that this actually worked was to ensure that all of the VisualIDs were generated synchronously, which actually turned out to be easy enough in Nautilus--I ended up hooking into the thumbnail-generation subsystem, which was already synchronous anyway.

Chris posited the obvious flaw in this scheme: if one has multiple computers, wouldn't one want the icons to be consistent across all of them? If the consistency breaks down, then doesn't the utility break down?

But, unless we have some way of coordinating between the distinct systems, this looks like a hard problem: we can't just use the `ouija-board navigation' technique, we actually have to come up with some sort of consistent algorithm for gleaning some sort of meaningful structure of free-form file-names. Chris didn't think it'd really be that hard of a problem. I'm not convinced that it's anything like easy.

It looks like I can actually punt, though--I can say:

If you want multiple computers to synchronise their repertoires of VisualIDs, then just synchronise their cache-directories--how you do it is outside the scope of this project; you should be able to use whatever mechanism you use to synchronise other files between computers.

And, for the time being, that's what I'm doing: I just added my .icons/ directory to my Unison configuration, and now it gets synchronised between my laptop and my desktop along with the rest of my home-directory. It works--it actually works really well. And it would work using any of the other zillion synchronisation-systems available; for GNOME, it might make sense to do it via Conduit--it probably works just using whatever generic file-synchronisation mechanism Conduit provides, but it looks like a specific `Synchronise Icons' option would be easy enough to add.

So, every aspect of grouping similar icons together is really pretty easy, at least as far as I can see.

Where it looks like things get difficult is actually in reverse: guaranteeing that different and unrelated things actually look different, and that things don't end up with similar icons just-by-happenstance. How can that be managed? Can we just assume that the PRNG will make it work out that way?

If anyone has any specific thoughts on either issue, I'd love to hear them....

[Reply]