Speaking of storing search results permanently and keeping them up to date, we could do the same thing with our concatenated lists of entries. Although that would be a lot bigger and probably less useful.
So once we perform the intersection, what happens to the results?
Obviously we have to store them back to disk.
If we sort the tags, concatenate them, and then hash them, we could get a unique identifier to save the result under.
So our intersection method should accept a separate index to store the results in, and we should support incremental updates to it. All we have to do is check the last item and find it in each of our input arrays. Naturally, when we search for it, we should start at the end. If it's the last item in any of our inputs, then we don't need to do anything at all; it's already up to date.
It seems like this whole algorithm/data structure thing is actually shaping up pretty well, although it's taking longer than I'd like.
Actually, maybe we could get it to work by performing the binary search and then waiting to write the results until the top half of the children are done, then writing our result, then continuing with the bottom half. That sounds pretty practical.
1. We need to iterate through the index in order
2. We need to look up in the index at random
3. We need to append to the index
4. We need to build the index EITHER in order or out of order, depending on algorithm
It's rule 4 that got us this time. The best algorithm for performing an intersection on an array in-order the result out-of-order. The best algorithm for building an intersection in-order searches the array out-of-order.
Okay, so our "unsorted" sorted intersection algorithm is even more complex than we thought.
Assuming our indexes are too big to fit in memory, it's a fair claim that the result will be too. So we need to produce our output direct-to-file. That's even harder than it sounds because our files are append-only and we want to keep it in order.
File/method synchronization between processes
> fs.flock(fd, flags, [callback])
> fs.seek(fd, offset, whence, [callback])
It'd be much better to store the raw binary hashes instead of the base 64 versions. Raw SHA1 hashes are 20 bytes, which contains a lot more information than a 14 byte base 64 value.
If our index files don't contain line breaks they'll hardly be readable without a hex editor anyway.
Now I'm thinking we can just use the "merge" method. Scan through both indexes in order. Whenever we get a pair that aren't the same, scan through the other file until we find a match.
But I think that means O(N^2) quadratic time in the worst case, because whenever a value is missing from the other file, we have to rescan that whole file to check. If there's no overlap between the two sets, we rescan each file over and over.
If we started in the middle of the bigger file, we would have to scan the entire smaller file; but then, assuming we found it, we would only have to scan half of the smaller file for each remaining entry. So a binary search approach seems like it would work.
I'm thinking a tree that just stores one or two characters of each entry hash per node.
If we just store one character per node, that's 64 characters (base 64 hash), that means we can store one 64-bit offset per character to get a 512 byte block per level.
Of course big reads on a HDD are good, so maybe it'd be better to store 2 characters together, which gets us 4096 combinations, meaning 32K per node.
That would mean (up to) 6 reads before a new value could be added.
The problem with all this, and the part where I get confused, is that we want to preserve the order of our entries. Do we need a separate file for that, or what?
My index tag is starting to get pretty big. It still loads super fast, for now, but what should we do?
Remember, the index tag works the same as our other tags, so we need a good system.
Displaying all of the entries for a tag on one page really was nice, since we could search through them, etc.
Maybe the solution is to just stop viewing the index tag all the time.
- Reverse the list
- Add links
- Convert entries to HTML AOT
- Add the new entry form to the tag list page