What is a Bloom Filter, Really?
When are sets not enough?
As we’ve seen, sets provide a powerful and efficient solution to specific problems. They allow you to efficiently determine if something belongs in a certain categorization, or not. Have I seen this song before? Have I seen this computer before? Have I seen this SSL certificate before?
Normal sets also store information about anything that has been inserted  this usually means you can enumerate over the inserted elements (in no particular order, of course).
Sets are fast  but normal sets take up quite a bit of space. If you want to view the contents of the set, you must store each element in its entirety. What if you forwent this ability and tried to store a shortened version of each element? Perhaps you only care if the inserted elements are in the set or not. Maybe the first 5 characters? Maybe a hash of the element? Both techniques suffer from the same issue: abcde5
and abcde6
map to the same element, abcde
, so you would lose information. With hashing, if there’s a collision (say abcd
and efgh
both hash to 1
) you will also lose information. In both cases you’ll have 1 element in your set when you wanted 2.
If you are keeping a set of cities you’ve visited, for example, you must have enough memory to store the string name of every city. Let’s say you’ve visited 100 cities, with names ranging from 5 to 20 characters long. If it takes 8 bits to store each character, you will need around 10,000 bits for your set.
Bloom Filters?
Bloom filters provide a besteffort small setlike data structure. Let’s focus on the operations you perform on a Bloom Filter. There are only two:

insert(element)
: Add the element to the set. 
member?(element)
: Is the element in the set?
You cannot enumerate elements in a bloom filter. You can check if you’ve added something, but there is absolutely no way to see “all you’ve added previously”.
This is because they shorten inserted elements to the smallest possible quantity of information (using hash functions) and accept a given collision rate. Collisions mean there will be some unavoidable falsepositive possibilities when checking for set membership. Let’s clear up the confusion with a deep dive.
Internals
A Bloom Filter has to store the following information.
 A bit vector. This is the main squeeze; it’s simply an ordered array of slots, each slot filled with either a
0
or a1
. Each slot has a number, or index. The array’s size is initialized at filter creation time. It does not dynamically resize.  A group of hash functions. We’ve discussed hash functions before  given an input they produce a bounded number. Hashing something is irreversible, and a hash functions codomain is uniformly distributed (no hotspots!)
When constructing a BF, I must know the upper bound of elements in my set. We’ll denote the number max of elements I plan to insert as n.
Insertion: Let’s say I have a filter with a bit vector of length 11 and 3 different hash functions. I want to insert the string cheese
.
index1 = hash1("cheese") % 11 # => 2
index2 = hash2("cheese") % 11 # => 6
index3 = hash3("cheese") % 11 # => 9
I compute 3 indices (modulo 11) and set the bits at those indices to 1
in my bit vector. cheese
has been inserted.
If I want to check for membership, I compute the same indices, but instead of setting the bits to 1
, I check if they’re all 1
. If so, I can say cheese
is in the set. If not, cheese
is not in the set.
What about collisions?
We know that whenever you deal with hashes % some number you must deal with collisions.
If cheese
sets bits 2, 6, and 9 to 1
, and pepper
sets bits 0, 8, and 5 to 1
, then member?(cheese)
and member?(pepper)
should return true
like we expect them to.
But what about onion
? onion
‘s three hash values at 6, 8, and 9. This means checking member?(onion)
will return true since the bits at indices 6, 8, and 9 all read 1
. We never inserted onion
! This is a false positive.
Unfortunately, this is simply the nature of Bloom Filters. You must accept a false positive rate. It is for this reason that BF’s member?
function can only ever return maybe it's there
or definitely not
.
It’s important to note that there is no false negative. It makes when you think about it  if you check for membership and one of the bits is a zero, you can be sure that element was never inserted.
We can calculate p, the false positive rate using
m : the number of bits in the vector
n : the number of elements (you anticipate) in the set
k : the number of hash functions to use.
Tweaking
If that math was too much, it’s not a problem. The important thing to know is you must tweak bloom filters for your need. Once you know the number of anticipated elements in the set and your accepted false positive rate (the chance member?
will return true
for something not in the set) you can calculate the parameters for your bloom filter using math (or this calculator). Luckily, most Bloom Filter libraries do this calculation for you.
To bring it home, let’s review the cities example from above. If you wanted to keep a set of 100 cities you’ve visited and you accepted a 1% false positive rate, you would only need 959 bits.
Next up: what can Bloom Filters do for you, and who uses them?