It is trivial to define a function which given a range of numbers R and the number of bits in the hash B, produces a hashing algorithm to hash strings which have a length within R into a hash of size B.
Here is such a trivial function, in pseudocode*:
make_hash( R , B ) =
// the longest possible string as input
Longest = max( R )
// if the number of bits is greater or equal to the number of bits in the string
if B >= Longest * 7 then
// Compute a hash from the given hash H and the string with first character C, and rest S
hash( H , C : S) =
// Shift H 7 bits to the left
Shifted_H = H << 7
// recurse, ORing C with the shifted hash
hash( S , Shifted_H | C )
hash( H , _ ) = H
// Use curried application to give an initial hash of 0
return hash( 0 )
What would be nice would be a hashing function which is optimal like this, for values within the range - but when a string has a length that is not in the range, the algorithm has a chance of producing a non unique hash for it. The further the length of the string is from the range, the lesser the chance of the algorithm producing unique hash, given the string.
I theorize that it is possible for strings of lengths beyond the range to have a unique hash, in some situations, and not in others. To take a simple example, in the range [ 1 .. B / 7 ], any value not in the range cannot have a unique hash - there are not enough bits! However in the range [ 1 .. ( B - 1 ) / 7 ] with there are some bits left over, so it is possible.
* That's some weird pseudo-code. Uppercase variables like Erlang? Bitwise operators like C? Guards like Haskell? Curried function application? Someone implement it!
6 October 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment