Count Min Sketch

Count-min sketch is a sublinear space data structure for summarizing data streams.

Count min sketch helps you to understand how many times certain item has occured in your stream. It's like a database which you can only retrieve a count from, without being able to retrieve a precise value. So you can ask it "how many times have I seen Alex?" but you can never ask it "what items have you seen at all?".

Algorithm

The algorithm itself is quite straightforward:

  • You create a table, filled with zeroes, of a width $w$ and height $d$
  • $d$ is calculated from $d = ⌈ln 1/δ ⌉$
  • $w$ is calculated from $w = ⌈e/ε⌉$
  • You take $d$ linear hash functions

Update

  • You for rows with indexes from 0 to $d-1$ (zero-based index), you calculate a hash of given value
  • Having a current update row denoted as $i$, and $i$ hash-function denoted as $h$, and current value in your table denoted as $old$
  • You update a table at position $[i, h(value)]$ to $old + 1$

Lookup

  • You for rows with indexes from 0 to $d-1$ (zero-based index), you calculate a hash of given value
  • Having a current update row denoted as $i$, and $i$ hash-function denoted as $h$
  • You find a maximum of values at positions $[i, h(value)]$
  • Profit!

Live demo

You can play around with a live-updating implementation of a count min sketch below. You enter a word, hit add button, table gets updated. For each entered word, you will see an estimated occurrence value.

 

Published on May 29, 2014

If you like my content, you might want to follow me on Twitter to subscribe for the future updates!