c++ bloom filter
Lets say that you have a 500000 lengh sentenceAGEGIOGRNOSEGSNOERNKLDVLLKNFKJEGRKLN...................GAGEADGE
And you want to convince people that there aren''t AAGTR, GTAGG, GGGGG int the string.
One way is chekcing every 5 strings like following. If you do not find them, you can conclude that there are not AAGTR, GTAGG, GGGGG
AGEGI
GEGIO
EGIOG
...........................EADGE
This way works but it is not very good.
For every input you have to compare 499995 times
There is a much better way than this!!!!
You can use bloom filter and the complexity is O(1) ; actually close to O(1), it is not exectly O(1)
Bloom filter is a good data structure that tell you a object is definitly not in the the set of objects.
However, it only can tell you that the object is probably in the set of object; the object may not in the set of objects.
How to
1) First thing you have to do is making a binary array based on a set of object.Make an int array. The length of array is determined by you. You need to make it big enough; small one is bad too. I will introduce a fomula leter....
If you make one, initialize to zero
2) Then you need obtain objects
If you use the above example, every 5 length of strings are considered as an object.
AGEGIOGRNOSEGSNOERNKLDVLLKNFKJEGRKLN...................GAGEADGE
AGEGI
GEGIO
EGIOG
...........................EADGE
Once you have a set of objects (5 length characters), apply several hash functions to get hash values for each objects.
The hash values are index of the array. If the array[hashval] is 0, change to 1
While you are putting 1 in the array, you will face hash collision.. like the following..
It is ok, leave them to be 1.
If the array contains all hash values of objects, you are ready to do the bloom filtering
3) let's do bloom filter
If we use above example, we want to know "AAGTR, GTAGG, GGGGG" are not in the 500000 length sentence.
Let's assume we use 4 hash functions to build the array.
Use the 4 hash functions on every object
Each object gives us 4 index of the map. Using the index, check the value of array.
If any of them has 0, we can say that the object is not definitely in the set.
If all of them are 1, w can say that the object is probably in the set
index1 index2 index3 index4
AAGTR 1 1 1 1
AAGTR 0 1 1 1
GGGGG 0 0 0 0
AAGTR is probably in the 500000 length setence
AAGTR, AAGTR is definitly not in the 500000 length setence.
Instead of looking at every 5 length string of 500000 length string, you just need to use 4 hash functions to figure out AAGTR, AAGTR is definitly not in the 500000 length setence.
It is really powerful tool. Isn't it??
Making the array is not very efficient. However, once you build one, search is super fast!!
Deciding the length of array and number of hash function are very important to create good bloom filter.
I research the fomula from the Internet.
People still do research to figure out the best way to do so..
If you have better one, can me tell me how??
p= probability of hash collioin you want to allow
n= number of objects you are planning to put in array
a = array length = ( ( n ) * ln( p ) ) / (ln(2)^2)
number of hash function = ( a/n ) * ln(2)
I made a sample code.
-------------------------------------------------------------------------------------------------------------------------
The following is practice code. If you have suggestion or correction, please feel free to leave comment.
#include <iostream>
#include <functional> // i am using mac so i need this to use namespace std
#include <cstdlib>
#include <ctime>
#include <string>
#include <math.h>
using namespace std;
int hashfunction1 (string input);
int hashfunction2 (string input2);
int hashfunction3 (string input3);
int hashfunction4 (string input4);
void populateBloomFilter (int table[], int index);
void testBloomFilter(int table[], string input);
int main(int argc, char*argv[])
{
string input="";
string str="";
char randomletter;
srand((unsigned)time(NULL));
//create string that has random char A to Z
for (int i=0; i<500000; i++){
// get random char
randomletter = 'A' + (rand()%26);
// convert string to char
str = string(1, randomletter);
// add string
input += str;
}
// bloom filter table
int bloomfilterTable [2000000];
// initialize bloomfilter array to 0
for (int i = 0; i < 2000000; ++i)
{
bloomfilterTable[i]=0;
}
//populate the table
string tmp="";
int hashVal1=0;
int hashVal2=0;
int hashVal3=0;
int hashVal4=0;
for (int i=0; i<499995; i++){
tmp = input.substr(i,5);
hashVal1 = hashfunction1(tmp);
hashVal2 = hashfunction2(tmp);
hashVal3 = hashfunction3(tmp);
hashVal4 = hashfunction4(tmp);
populateBloomFilter(bloomfilterTable, hashVal1);
populateBloomFilter(bloomfilterTable, hashVal2);
populateBloomFilter(bloomfilterTable, hashVal3);
populateBloomFilter(bloomfilterTable, hashVal4);
}
//create random string with length 5
for(int j=0 ; j<10; j++){
string testInput="";
for (int i=0; i<5;i++){
// get random char
randomletter = 'A' + (random()%26);
// convert string to char
str = string(1, randomletter);
// add string
testInput += str;
}
//test
testBloomFilter(bloomfilterTable, testInput);
}
}
int hashfunction1 (string input)
{
double output=0;
double tmp=0;
for (int i = 0; i < input.length(); ++i)
{
output += (input[i]*pow(10,(double)i));
}
output +=9678939;
output = fmod(output, 2000000);
return (int) output;
}
int hashfunction2 (string input)
{
double output=0;
double tmp=0;
for (int i = 0; i < input.length(); ++i)
{
output += (input[i]*pow(10,(double)i));
}
output +=11111111;
output = fmod(output,2000000);
return (int) output;
}
int hashfunction3 (string input)
{
double output=0;
double tmp=0;
for (int i = 0; i < input.length(); ++i)
{
output += (input[i]*pow(10,(double)i));
}
output +=7865432;
output = fmod(output, 2000000);
return (int) output;
}
int hashfunction4 (string input)
{
double output = 0;
double tmp = 0;
for (int i = 0; i < input.length(); ++i)
{
output += (input[i]%192*pow(10,(double)i));
}
output = fmod(output, 2000000);
return (int) output;
}
void populateBloomFilter (int table[], int index)
{
if( table[index] == 0 ){
table[index] = 1;
}
}
void testBloomFilter(int table[], string input)
{
int val = hashfunction1(input);
int val2 = hashfunction2(input);
int val3 = hashfunction3(input);
int val4 = hashfunction4(input);
bool checker1 = (table[val] == 1);
bool checker2 = (table[val2] == 1);
bool checker3 = (table[val3] == 1);
bool checker4 = (table[val4] == 1);
cout << checker1 << '\t' << checker2 << '\t' << checker3 << '\t'<< checker4 << endl;
// above checkers are all true, input is probably in the 2000000 string
if (checker1 && checker2 && checker3 && checker4 == true ){
cout << input << " is probably in the 2000000 string" << endl;
}
else{
cout << input << " is definitly not in the 2000000 string" << endl;
}
}
No comments:
Post a Comment