Consider Dan Bernstein's hashing algorithm:
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c;
return hash;
}
(I've yanked this from U of York's CS pages)
I've often wondered about this algorithm. It always striked me as strange as to why it shifts left by 5, rather than 8. I reckon the overlapping allows longer strings to be hashed, at the expense of uniqueness. This would also be why the hash is added to its shifted sibling.
However I randomly generated some input strings and hashed each one with this Haskell program, to see how many of the resultant hashes would collide. The answers surprise me:
import Data.Bits
import Foreign.C.Types
import Data.List
chr :: CULong -> Char
chr = toEnum . fromEnum
ord :: Char -> CULong
ord = toEnum . fromEnum
bhash :: String -> CULong
bhash = bhash' 5381
where
bhash' hs ( c : s ) = ord c + bhash' ((hs `shiftL` 5) + hs) s
bhash' hs _ = hs
collide :: CULong -> IO ( )
collide n =
let l = map bhash $ mk_syms n
cols = length l - ( length $ nub l )
in print $ "There were " ++ show cols ++ " collisions"
mk_syms :: CULong -> [ String ]
mk_syms n = map mk_unique_sym [ 0 .. n ]
mk_unique_sym :: CULong -> String
mk_unique_sym = mk_unique_sym' [ ]
mk_unique_sym' :: String -> CULong -> String
mk_unique_sym' acc w =
if w >= 26
then
mk_unique_sym' ( c8 : acc ) $ ( w `div` 26 ) - 1
else ( chr $ 97 + w ) : acc
where
c8 :: Char
c8 = chr $ ( w `mod` 26 ) + 97
GHCi, version 6.10.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer ... linking ... done.
Loading package base ... linking ... done.
[1 of 1] Compiling Main ( dan_hash.hs, interpreted )
Ok, modules loaded: Main.
*Main> collide 100
"There were 48 collisions"
*Main> collide 1000
"There were 888 collisions"
*Main> collide 10000
"There were 9861 collisions"
*Main> collide 100000
"There were 99769 collisions"
I'm assuming the high number of collisions are a property of my mechanical input, though its not surprising that the number of collisions increase as the input grows.
Tomorrow I think I might feed the words from the aligned hansards into it, to see if human speech fares better.
PS mk_unique_sym is what the Delve compiler uses to make unique names for new variables in intermediate code phase.
Also - the type checker has type checked a couple of expressions. I'm very happy - a full release should be soon :)
And then its time to rewrite the VM.... the work goes on.
5 October 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment