5 October 2009

Hash (algorithms, not the sticky kind)

Consider Dan Bernstein's hashing algorithm:

unsigned long
hash(unsigned char *str)
{
   unsigned long hash = 5381;
   int c;

   while (c = *str++)
   hash = ((hash << 5) + hash) + c;

   return hash;
}

(I've yanked this from U of York's CS pages)

I've often wondered about this algorithm. It always striked me as strange as to why it shifts left by 5, rather than 8. I reckon the overlapping allows longer strings to be hashed, at the expense of uniqueness. This would also be why the hash is added to its shifted sibling.

However I randomly generated some input strings and hashed each one with this Haskell program, to see how many of the resultant hashes would collide. The answers surprise me:

import Data.Bits
import Foreign.C.Types
import Data.List

chr :: CULong -> Char
chr = toEnum . fromEnum

ord :: Char -> CULong
ord = toEnum . fromEnum

bhash :: String -> CULong
bhash = bhash' 5381
   where
   bhash' hs ( c : s ) = ord c + bhash' ((hs `shiftL` 5) + hs) s
   bhash' hs _ = hs

collide :: CULong -> IO ( )
collide n =
   let l = map bhash $ mk_syms n
       cols = length l - ( length $ nub l )
   in  print $ "There were " ++ show cols ++ " collisions"

mk_syms :: CULong -> [ String ]
mk_syms n = map mk_unique_sym [ 0 .. n ]

mk_unique_sym :: CULong -> String
mk_unique_sym = mk_unique_sym' [ ]

mk_unique_sym' :: String -> CULong -> String
mk_unique_sym' acc w =
   if w >= 26
      then
         mk_unique_sym' ( c8 : acc ) $ ( w `div` 26 ) - 1
      else ( chr $ 97 + w ) : acc
   where
   c8 :: Char
   c8 = chr $ ( w `mod` 26 ) + 97

GHCi, version 6.10.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer ... linking ... done.
Loading package base ... linking ... done.
[1 of 1] Compiling Main ( dan_hash.hs, interpreted )
Ok, modules loaded: Main.
*Main> collide 100
"There were 48 collisions"
*Main> collide 1000
"There were 888 collisions"
*Main> collide 10000
"There were 9861 collisions"
*Main> collide 100000
"There were 99769 collisions"

I'm assuming the high number of collisions are a property of my mechanical input, though its not surprising that the number of collisions increase as the input grows.

Tomorrow I think I might feed the words from the aligned hansards into it, to see if human speech fares better.

PS mk_unique_sym is what the Delve compiler uses to make unique names for new variables in intermediate code phase.
Also - the type checker has type checked a couple of expressions. I'm very happy - a full release should be soon :)
And then its time to rewrite the VM.... the work goes on.

No comments:

Post a Comment