Algo Hash
Description
The rough idea is to create an index which associates to any given position a list of all the games in which the position occurs. This is particularly useful for full board searches. To make this more efficient, we do not store the position itself (which would take a lot of space), but instead a so called hash code which is a number which is easily computed from the position and which is in practice unique.
Given a search pattern, we just compute the hash code for that pattern, and then look up the list of corresponding games from our list - this is of course a very fast operation in comparison to going through the complete list of games. Depending on how much you trust the uniqueness of the hash codes, one could then check whether these candidates are real hits.
There will be three variants of this algorithm in the final libkombilo: hashing for full board positions, for corner/side patterns and for center patterns.
Full board patterns
Hashing for full board patterns is already implemented. The current method is a straight-forward implementation of the principle explained above. When going through the games during processing, we compute the hash code for every position within every game, and write all these hash codes together with information about the corresponding game, the move number of the occurence, and the continuation to a database.
In this case, there are two ways to use the database. First, as with every hashing algorithm, we can use the database to produce candidates.
Options: In practice, it makes sense to limit this method to positions with at most 70 (or some other 'reasonable' number, actually much lower than 70 is probably enough) stones on the board. For one thing, in practice most full board searches will not have that many stones on the board. On the other hand, if there are very many stones, than the finalpos algorithm is very efficient, and searches are very fast anyway.
Ideas about improvements: (This currently consists of notes to myself rather than anything else) Currently, for a hash search the program takes the pattern list (with up to 16 patterns) and searches for each of them. It would be better to 'symmetrize' everything, so that only two searches (b/o the color reversal) are necessary. Of course, this has to be taken into account already during the processing. Since the current algorithm is fast enough for all practical purposes, this is more a question of the algorithm being most elegant than of speed improvement.
Corner/side patterns
The basic principle applies, but in these cases we will have to check the validity of candidates from the hashing database by a second algorithm in any case.
Note: hashing for corner/side patterns is not implemented yet. What follows is basically a collection of some ideas about this.
It seems reasonable to put 7x7 corner patterns and 4x6 (or even 4x7 or 4x8) side patterns into hash databases. This means that instead of using hash codes, we could just store the complete pattern in 10 bytes (for 7x7) or in 8 bytes (for 4xN, N <= 8). This is preferable to hash codes since with hash codes there is always a slight risk of collisions.
It seems possible to encode many (most?) relevant positions with even less bytes (but with a variable number of bytes) by using some hand-made compression algorithm. If we linearize a rectangle pattern, in practice there will very often be strings of several adjacent empty spots. This can be used to 'compress' the string describing the pattern a little.
As in the full board case, is is desirable to 'symmetrize' everything.
Center patterns
Nothig is implemented yet, and before doing so, I need to do some statistical analysis on center patterns occuring in real life. (The problem is that in this case one has to reduce the amount of data somehow; putting "everything" (as for full board/corner/side patterns) into the database would blow it up too much.) The current plan is roughly to create a list of 250,000 patterns, say, which are most likely to occur in searches, and to create an index which stores in which games and where they occur. This list of patterns would be hard-wired into the program.
