6 October 2009

hashing within a range

It is trivial to define a function which given a range of numbers R and the number of bits in the hash B, produces a hashing algorithm to hash strings which have a length within R into a hash of size B.

Here is such a trivial function, in pseudocode*:

make_hash( R , B ) =
   // the longest possible string as input
   Longest = max( R )
   // if the number of bits is greater or equal to the number of bits in the string
   if B >= Longest * 7 then
      // Compute a hash from the given hash H and the string with first character C, and rest S
      hash( H , C : S) =
         // Shift H 7 bits to the left
         Shifted_H = H << 7
         // recurse, ORing C with the shifted hash
         hash( S , Shifted_H | C )
      hash( H , _ ) = H
      // Use curried application to give an initial hash of 0
      return hash( 0 )

What would be nice would be a hashing function which is optimal like this, for values within the range - but when a string has a length that is not in the range, the algorithm has a chance of producing a non unique hash for it. The further the length of the string is from the range, the lesser the chance of the algorithm producing unique hash, given the string.

I theorize that it is possible for strings of lengths beyond the range to have a unique hash, in some situations, and not in others. To take a simple example, in the range [ 1 .. B / 7 ], any value not in the range cannot have a unique hash - there are not enough bits! However in the range [ 1 .. ( B - 1 ) / 7 ] with there are some bits left over, so it is possible.

* That's some weird pseudo-code. Uppercase variables like Erlang? Bitwise operators like C? Guards like Haskell? Curried function application? Someone implement it!

No comments:

Post a Comment