Saturday, December 12, 2015

c++ quick sort

C++ quick sort

Quick sort is a powerful sorting alogrithm. It sort a set in O( n*log(n) ).
There are bad cases which the big O can be O(n^2). However, the case is very very rare.

In order to understand this sorting algoritm, you need to know what recussive funtion..
If you do not know that, think it as using same function over and over; you will see that soon.

Let's say that you would like to sort the following array.


In order to do quick sort, you need two functions.

One is called, "quick_sort." The other one is called "partition."

"quick_sort" calls the "partition".
"partition" picks one of array value( only one!!!!), place the value to right position and return the position of value

quick_sort

void quickSort (int array[], int low, int high)
{
    if( low < high ){

        int position = partition(array, low, high);
        quickSort (array, low, position-1);
        quickSort (array, position+1, high);
    }
    else{
        return;
    }
}

array = arrary that you want to sort
low = first index of set
high = last index of set

The first thing the "quick_sort" is checking whether the low is smaller than high.
It may look strange to you. However, if you read the following three line, you can understand the reason.
After chekcing the condition, "quick_sort" calls "partition" to place one element to right poisiton in the array.
"partition" makes sure one element of the array is in the right position and returns position.

 The position is used to create two "quick_sort"s.

        quickSort (array, low, position-1);   -----> low = low           position -1 = high
        quickSort (array, position+1, high); ------> low = position+1     high = high

"quick_sort" makes "quick_sort"s over and over until low >= high.

This kind of function is called recursive function.


Then I will move on to the "partiton"

 Just like "quick_sort", you have three parameters arary, low high

        int partition (int array[], int low, int high)

 In the you need to have three tools
They are
               i = index
               j = index
               pivot_val = array[ high ]

It would be like this

 If array[i] > pivot_val, increment i by one
        6                3
If array[i] <= pivot_val, switch array[i] and array[j]. and increament i and j by one.
          1                3
        

Do this untill i reach "high"



Then switch array[i] and array[j]



The element is in yellow box is placed right postion. "partiton", returns the position of this.

"quick_sort" calls "quick_sort"s to work on The rest elemnt of array

The power of quick sort is that everytime, if place a element in right postiion, it it will breaks array into small pieces for sorting.

I made a sample code.

-------------------------------------------------------------------------------------------------------------------------
 The following is practice code. If you have suggestion or correction, please feel free to leave comment.

#include <iostream>
#include <functional> // i am using mac so i need this to use namespace std
#include <ctime>
#include <cstdlib>


using namespace std;

void quickSort (int array[], int low, int high);
int partition (int array[], int lcow, int high);

int main()
{
    srand(time(0));
    int array[8] = {7,6,4,1,20,4,5,7};
    int len = 8;

    /*
    // populate array with random number between 0 to 99
    for (int i = 0; i < len; i++)
    {
        array[i] = rand()%100;
    }
    */
    quickSort(array, 0, 7);

    //print
    for (int i = 0; i < len; i++)
    {
        cout << array[i] << " ";
    }
    cout << endl;

}

/*
 * @brief Do quick sort (recursive function)
 * @param int array[] - array that is needed to be sort
 * @param int low - min bound for sorting
 * @param int high - max bound for sorting
 * @return void (call by referece) sorted array
 */
void quickSort (int array[], int low, int high)
{

    if( low < high ){
        // after the partition, val in the position is sorted
        int position = partition(array, low, high);
        cout << "position " << position << endl;
        // sort values before the postion
        quickSort (array, low, position-1);
        // sort values after the position
        quickSort (array, position+1, high);

    }

    // if row and high are same, there is nothing to be sorted, so termimate the sorting
    else{
        return;
    }
}

/*
 * @brief Pick a pivot and put it in the right position
 * @param int array[] - array that is needed to be sort
 * @param int row - min bound for sorting
 * @param int high - max bound for sorting
 * @return index of pivot
 */
int partition (int array[], int low, int high)
{
    int tmp;

    int pivot_val = array[high];

    int j=low;
    int i;
    for(i=low; i<high; i++){
        if( array[i] <= pivot_val ){
            tmp = array[i];
            array[i] = array[j];
            array[j] = tmp;
            j++;
        }
    }
    tmp  = array[high];
    array[high] = array[j];
    array[j] = tmp;

    return j;

}





 

Saturday, December 5, 2015

c++ bloom filter

c++ bloom filter

Lets say that you have a 500000 lengh sentence

AGEGIOGRNOSEGSNOERNKLDVLLKNFKJEGRKLN...................GAGEADGE

And you want to convince people that there aren''t AAGTR, GTAGG, GGGGG int the string.

One way is chekcing every 5 strings like following. If you do not find them, you can conclude that there are not AAGTR, GTAGG, GGGGG
 
AGEGI
   GEGIO
      EGIOG
          ...........................EADGE
This way works but it is not very good.
For every input you have to compare 499995 times

There is a much better way than this!!!!
You can use bloom filter and the complexity is O(1) ; actually close to O(1), it is not exectly O(1)

Bloom filter is a good data structure that tell you a object is definitly not in the the set of objects.
However, it only can tell you that the object is probably in the set of object; the object may not in the set of objects.

How to

1) First thing you have to do is making a binary array based on a set of object.

Make an int array. The length of array is determined by you. You need to make it big enough; small one is bad too. I will introduce a fomula leter....
If you make one, initialize to zero

2) Then you need obtain objects
If you use the above example, every 5 length of strings are considered as an object.

AGEGIOGRNOSEGSNOERNKLDVLLKNFKJEGRKLN...................GAGEADGE

AGEGI
   GEGIO
      EGIOG
          ...........................EADGE

Once you have a set of objects (5 length characters), apply several hash functions to get hash values for each objects.


The hash values are index of the array. If the array[hashval] is 0, change to 1

While you are putting 1 in the array, you will face hash collision.. like the following..




It is ok, leave them to be 1.


If the array contains all hash values of objects, you are ready to do the bloom filtering


3) let's do bloom filter
    If we use above example, we want to know "AAGTR, GTAGG, GGGGG" are not in the 500000 length sentence.
    Let's assume we use 4 hash functions to build the array.
 
    Use the 4 hash functions on every object
    Each object gives us 4 index of the map. Using the index, check the value of array.
    If any of them has 0, we can say that the object is not definitely in the set.
    If all of them are 1, w can say that the object is probably in the set
                                       index1               index2             index3              index4
    AAGTR                           1                       1                     1                         1
    AAGTR                           0                       1                     1                         1
    GGGGG                          0                        0                      0                        0

    AAGTR is probably in the 500000 length setence
    AAGTR, AAGTR is definitly not in the 500000 length setence.

Instead of looking at every 5 length string of 500000 length string, you just need to use 4 hash functions to figure out AAGTR, AAGTR is definitly not in the 500000 length setence.

It is really powerful tool. Isn't it??
Making the array is not very efficient. However, once you build one, search is super fast!!


Deciding the length of array and number of hash function are very important to create good bloom filter.

I research the fomula from the Internet.
People still do research to figure out the best way to do so..
If you have better one, can me tell me how??

p= probability of hash collioin you want to allow
n= number of objects you are planning to put in array
 
a = array length =  ( ( n ) * ln( p ) )  / (ln(2)^2)

number of hash function = ( a/n ) * ln(2)


I made a sample code.

-------------------------------------------------------------------------------------------------------------------------
 The following is practice code. If you have suggestion or correction, please feel free to leave comment.

#include <iostream>
#include <functional> // i am using mac so i need this to use namespace std
#include <cstdlib>
#include <ctime>
#include <string>
#include <math.h>

using namespace std;

int hashfunction1 (string input);
int hashfunction2 (string input2);
int hashfunction3 (string input3);
int hashfunction4 (string input4);
void populateBloomFilter (int table[], int index);
void testBloomFilter(int table[], string input);



int main(int argc, char*argv[])
{
    string input="";
    string str="";
    char randomletter;

    srand((unsigned)time(NULL));

    //create string that has random char A to Z
    for (int i=0; i<500000; i++){

        // get random char
        randomletter = 'A' + (rand()%26);
        // convert string to char
        str = string(1, randomletter);
        // add string
        input += str;
    }

    // bloom filter table
    int bloomfilterTable [2000000];

    // initialize bloomfilter array to 0
    for (int i = 0; i < 2000000; ++i)
    {
        bloomfilterTable[i]=0;
    }

    //populate the table
    string tmp="";
    int hashVal1=0;
    int hashVal2=0;
    int hashVal3=0;
    int hashVal4=0;
    for (int i=0; i<499995; i++){

        tmp = input.substr(i,5);
        hashVal1 = hashfunction1(tmp);
        hashVal2 = hashfunction2(tmp);
        hashVal3 = hashfunction3(tmp);
        hashVal4 = hashfunction4(tmp);

        populateBloomFilter(bloomfilterTable, hashVal1);
        populateBloomFilter(bloomfilterTable, hashVal2);
        populateBloomFilter(bloomfilterTable, hashVal3);
        populateBloomFilter(bloomfilterTable, hashVal4);
    }


    //create random string with length 5
   
    for(int j=0 ; j<10; j++){
        string testInput="";
        for (int i=0; i<5;i++){
            // get random char
            randomletter = 'A' + (random()%26);
            // convert string to char
            str = string(1, randomletter);
            // add string
            testInput += str;
        }

        //test
        testBloomFilter(bloomfilterTable, testInput);
    }
}

int hashfunction1 (string input)
{
    double output=0;
    double tmp=0;
    for (int i = 0; i < input.length(); ++i)
    {
        output += (input[i]*pow(10,(double)i));
    }

    output +=9678939;
    output = fmod(output, 2000000);

    return (int) output;

}

int hashfunction2 (string input)
{
    double output=0;
    double tmp=0;
    for (int i = 0; i < input.length(); ++i)
    {
        output += (input[i]*pow(10,(double)i));
    }

    output +=11111111;
    output = fmod(output,2000000);

    return (int) output;
}

int hashfunction3 (string input)
{
    double output=0;
    double tmp=0;
    for (int i = 0; i < input.length(); ++i)
    {
        output += (input[i]*pow(10,(double)i));
    }

    output +=7865432;
    output = fmod(output, 2000000);

    return (int) output;
}

int hashfunction4 (string input)
{
    double output = 0;
    double tmp = 0;
    for (int i = 0; i < input.length(); ++i)
    {
        output += (input[i]%192*pow(10,(double)i));
    }

    output = fmod(output, 2000000);

    return (int) output;
}

void populateBloomFilter (int table[], int index)
{
    if( table[index] == 0 ){
        table[index] = 1;
    }

}

void testBloomFilter(int table[], string input)
{
    int val  = hashfunction1(input);
    int val2 = hashfunction2(input);
    int val3 = hashfunction3(input);
    int val4 = hashfunction4(input);

    bool checker1 = (table[val] == 1);
    bool checker2 = (table[val2] == 1);
    bool checker3 = (table[val3] == 1);
    bool checker4 = (table[val4] == 1);

    cout << checker1 << '\t' << checker2 << '\t' << checker3 << '\t'<<  checker4 << endl;
    // above checkers are all true, input is probably in the 2000000 string
    if (checker1 && checker2 && checker3 && checker4 == true ){
        cout << input << " is probably in the 2000000 string" << endl;
    }
    else{
        cout << input << " is definitly not in the 2000000 string" << endl;
    }


}