Thursday, February 11, 2016

C++ kseq

C++ kseq.h

(I assume you are using ubuntu)
kseq.h is a c++ tool that helps you read fastq/fasta file line by line.


If you go to https://github.com/samtools/htslib/tree/develop/htslib , You can download kseq.h

place kseq.h  in same directory as your c++ program.

In order to run kseq, you need to intall zlib for your c++ compiler.
Tpe following command in your terminal

sudo apt-get install zlib1g-dev

 

This command will install zlib for your compiler.

I will give you a sample program that read fastq/a file.
After giving you a sample code, I will tell you how to compile your code.

#include <iostream>

// include kseq.h
#include "kseq.h"
// include zlib.h
#include "/home/tom/software/zlib-1.2.8/zlib.h" // download zlib (http://www.zlib.net/)
                                                                             // extract the downloaded file and find the zlib.h
                                                                              // get full path (the command you did for zlib if for your compiler only, so you have to do this)

//kseq is initialized by zlib
KSEQ_INIT(gzFile, gzread)

using namespace std;

int main()
{
     // your fastq/a file has to be read by zlib
     gzFile fp;
     char * input_fastq = "path to your fastq file";
     fp = gzopen(input_fastq, "r");

     //  initializing fastq/a stream
    kseq_t *seq;
    seq= kseq_init(fp);

    // read fastq/a
    while( kseq_read(seq) >=0){
        //printing id
        cout << seq->name.s << endl;

       //printing seq
       cout << seq->seq.s << endl;

       // if it is fastq
       //printing comment
       if (seq->comment.l) cout<< seq->comment.s;

       // if it is fastq
       // printing quality
       if (seq->qual.l) cout <<seq->qual.s;  

    }

     // clean up
     kseq_destroy(seq);
     gzclose(fp);

}


The below command is compiling your program. Make sure you add "-lz"

g++ your_program.cpp -lz

 

If you have a concern or question, feel free to ask.
Thank you

Thursday, February 4, 2016

C++ assert

C++ assert

Assert is a great tool for debugging your c++ code.

Let's have look the following codes.

    #include <iostream>
    #include <functional>
    #include <assert.h>

    using namespace std;

        int main()
        {
            int a = 10;
            assert(a >10);

            char b ='a';
            cout << b << endl
        }

         If I compile and run it, I get the following message

         Assertion failed: (a >10), function main, file main.cpp, line 10.
         Abort trap: 6
   
     int main()
    {
          int a = 10;
          assert(a ==10);
          char b ='a';
          cout << b << endl
     }

     If I compile and run it, it will display a in the monitor

As you have guessed, assert helps you test variable.
You need to put a condion for assert.
If the condition is true, nothing is going to happen.
If it is false, assert terminates program, and tell us what the problem is.


If you  create couple hundreds lines of code, I am sure that you would like to test some critical parts of your work. Why don't you use assert?

In my opinion, assert is a good tool for c++ development.


If you disagree or have better tool, please let me know.

Saturday, January 30, 2016

c++ preprocessor

c preprocessor

When you work on c++ programming, we always start with #include<.....>

Like following

#include <iostream>
#include <string>

int main()
{
}

or if you look at someone else's coding, you have seen the following..

#define PI 3.14

#ifndef OBJECT_H_
     #define OBJECT_H_
#endif

These lines starting with # are preprocessor command.

Before I go further about these commands, let's talk about compiler little bit.

A compiler is not smart and does not understand C++ coding. C++ coding has to be translated into machine language(which is combination of 1 and 0). Compiler does the translation.

Before a compiler translates your c++ code, your code has be processed by a c preprocessor.
You can think the c preprocessor as a text replacer.


Let's look at the following c++ line.

       #include <iostream>
       #define PI 3.14

       using namespace std;

       int main()
      {
               cout << "hello world" ;
               cout << PI ;
      }

The above is very a simple piece of code.
It will print "hello world" and 3.14 in the monitor.
"cout" and "<<" are used to print "hello world"and 3.14.

#include <iostream> is a preprocessor command that add c++ standard library in you code

When a compiler to compile your c++ code, it does not compile the code right away.

Firstly, c++ preprocessor changes your code.

     #include <iostream>
     #define PI 3.14

     using namespace std;
 
     int main()
     {
         { system.out ~~~~~~~}   (operator~~~~~~~`) " hello world" ;
         { system.out ~~~~~~~}   (operator~~~~~~~`) 3.14
     }

strings "cout" and "<<" does not do anything. preprocessor reads these pieces and convert them to meaningful instances.
Whenever, preprocessor find PI, it will change PI to 3.14

Once preprocessor finish changing all instances, compiler compiles your code.


Think a c preprocessor as a tool that helps a compiler to compile your code.

The following is the actual compiling process.

c++ code ----> processed with  c++preprocessor  ----> compiled with compiler -------> executable.


If you think I misunderstand the concept, please feel free to talk to me or leave a comment.


















Saturday, December 12, 2015

c++ quick sort

C++ quick sort

Quick sort is a powerful sorting alogrithm. It sort a set in O( n*log(n) ).
There are bad cases which the big O can be O(n^2). However, the case is very very rare.

In order to understand this sorting algoritm, you need to know what recussive funtion..
If you do not know that, think it as using same function over and over; you will see that soon.

Let's say that you would like to sort the following array.


In order to do quick sort, you need two functions.

One is called, "quick_sort." The other one is called "partition."

"quick_sort" calls the "partition".
"partition" picks one of array value( only one!!!!), place the value to right position and return the position of value

quick_sort

void quickSort (int array[], int low, int high)
{
    if( low < high ){

        int position = partition(array, low, high);
        quickSort (array, low, position-1);
        quickSort (array, position+1, high);
    }
    else{
        return;
    }
}

array = arrary that you want to sort
low = first index of set
high = last index of set

The first thing the "quick_sort" is checking whether the low is smaller than high.
It may look strange to you. However, if you read the following three line, you can understand the reason.
After chekcing the condition, "quick_sort" calls "partition" to place one element to right poisiton in the array.
"partition" makes sure one element of the array is in the right position and returns position.

 The position is used to create two "quick_sort"s.

        quickSort (array, low, position-1);   -----> low = low           position -1 = high
        quickSort (array, position+1, high); ------> low = position+1     high = high

"quick_sort" makes "quick_sort"s over and over until low >= high.

This kind of function is called recursive function.


Then I will move on to the "partiton"

 Just like "quick_sort", you have three parameters arary, low high

        int partition (int array[], int low, int high)

 In the you need to have three tools
They are
               i = index
               j = index
               pivot_val = array[ high ]

It would be like this

 If array[i] > pivot_val, increment i by one
        6                3
If array[i] <= pivot_val, switch array[i] and array[j]. and increament i and j by one.
          1                3
        

Do this untill i reach "high"



Then switch array[i] and array[j]



The element is in yellow box is placed right postion. "partiton", returns the position of this.

"quick_sort" calls "quick_sort"s to work on The rest elemnt of array

The power of quick sort is that everytime, if place a element in right postiion, it it will breaks array into small pieces for sorting.

I made a sample code.

-------------------------------------------------------------------------------------------------------------------------
 The following is practice code. If you have suggestion or correction, please feel free to leave comment.

#include <iostream>
#include <functional> // i am using mac so i need this to use namespace std
#include <ctime>
#include <cstdlib>


using namespace std;

void quickSort (int array[], int low, int high);
int partition (int array[], int lcow, int high);

int main()
{
    srand(time(0));
    int array[8] = {7,6,4,1,20,4,5,7};
    int len = 8;

    /*
    // populate array with random number between 0 to 99
    for (int i = 0; i < len; i++)
    {
        array[i] = rand()%100;
    }
    */
    quickSort(array, 0, 7);

    //print
    for (int i = 0; i < len; i++)
    {
        cout << array[i] << " ";
    }
    cout << endl;

}

/*
 * @brief Do quick sort (recursive function)
 * @param int array[] - array that is needed to be sort
 * @param int low - min bound for sorting
 * @param int high - max bound for sorting
 * @return void (call by referece) sorted array
 */
void quickSort (int array[], int low, int high)
{

    if( low < high ){
        // after the partition, val in the position is sorted
        int position = partition(array, low, high);
        cout << "position " << position << endl;
        // sort values before the postion
        quickSort (array, low, position-1);
        // sort values after the position
        quickSort (array, position+1, high);

    }

    // if row and high are same, there is nothing to be sorted, so termimate the sorting
    else{
        return;
    }
}

/*
 * @brief Pick a pivot and put it in the right position
 * @param int array[] - array that is needed to be sort
 * @param int row - min bound for sorting
 * @param int high - max bound for sorting
 * @return index of pivot
 */
int partition (int array[], int low, int high)
{
    int tmp;

    int pivot_val = array[high];

    int j=low;
    int i;
    for(i=low; i<high; i++){
        if( array[i] <= pivot_val ){
            tmp = array[i];
            array[i] = array[j];
            array[j] = tmp;
            j++;
        }
    }
    tmp  = array[high];
    array[high] = array[j];
    array[j] = tmp;

    return j;

}





 

Saturday, December 5, 2015

c++ bloom filter

c++ bloom filter

Lets say that you have a 500000 lengh sentence

AGEGIOGRNOSEGSNOERNKLDVLLKNFKJEGRKLN...................GAGEADGE

And you want to convince people that there aren''t AAGTR, GTAGG, GGGGG int the string.

One way is chekcing every 5 strings like following. If you do not find them, you can conclude that there are not AAGTR, GTAGG, GGGGG
 
AGEGI
   GEGIO
      EGIOG
          ...........................EADGE
This way works but it is not very good.
For every input you have to compare 499995 times

There is a much better way than this!!!!
You can use bloom filter and the complexity is O(1) ; actually close to O(1), it is not exectly O(1)

Bloom filter is a good data structure that tell you a object is definitly not in the the set of objects.
However, it only can tell you that the object is probably in the set of object; the object may not in the set of objects.

How to

1) First thing you have to do is making a binary array based on a set of object.

Make an int array. The length of array is determined by you. You need to make it big enough; small one is bad too. I will introduce a fomula leter....
If you make one, initialize to zero

2) Then you need obtain objects
If you use the above example, every 5 length of strings are considered as an object.

AGEGIOGRNOSEGSNOERNKLDVLLKNFKJEGRKLN...................GAGEADGE

AGEGI
   GEGIO
      EGIOG
          ...........................EADGE

Once you have a set of objects (5 length characters), apply several hash functions to get hash values for each objects.


The hash values are index of the array. If the array[hashval] is 0, change to 1

While you are putting 1 in the array, you will face hash collision.. like the following..




It is ok, leave them to be 1.


If the array contains all hash values of objects, you are ready to do the bloom filtering


3) let's do bloom filter
    If we use above example, we want to know "AAGTR, GTAGG, GGGGG" are not in the 500000 length sentence.
    Let's assume we use 4 hash functions to build the array.
 
    Use the 4 hash functions on every object
    Each object gives us 4 index of the map. Using the index, check the value of array.
    If any of them has 0, we can say that the object is not definitely in the set.
    If all of them are 1, w can say that the object is probably in the set
                                       index1               index2             index3              index4
    AAGTR                           1                       1                     1                         1
    AAGTR                           0                       1                     1                         1
    GGGGG                          0                        0                      0                        0

    AAGTR is probably in the 500000 length setence
    AAGTR, AAGTR is definitly not in the 500000 length setence.

Instead of looking at every 5 length string of 500000 length string, you just need to use 4 hash functions to figure out AAGTR, AAGTR is definitly not in the 500000 length setence.

It is really powerful tool. Isn't it??
Making the array is not very efficient. However, once you build one, search is super fast!!


Deciding the length of array and number of hash function are very important to create good bloom filter.

I research the fomula from the Internet.
People still do research to figure out the best way to do so..
If you have better one, can me tell me how??

p= probability of hash collioin you want to allow
n= number of objects you are planning to put in array
 
a = array length =  ( ( n ) * ln( p ) )  / (ln(2)^2)

number of hash function = ( a/n ) * ln(2)


I made a sample code.

-------------------------------------------------------------------------------------------------------------------------
 The following is practice code. If you have suggestion or correction, please feel free to leave comment.

#include <iostream>
#include <functional> // i am using mac so i need this to use namespace std
#include <cstdlib>
#include <ctime>
#include <string>
#include <math.h>

using namespace std;

int hashfunction1 (string input);
int hashfunction2 (string input2);
int hashfunction3 (string input3);
int hashfunction4 (string input4);
void populateBloomFilter (int table[], int index);
void testBloomFilter(int table[], string input);



int main(int argc, char*argv[])
{
    string input="";
    string str="";
    char randomletter;

    srand((unsigned)time(NULL));

    //create string that has random char A to Z
    for (int i=0; i<500000; i++){

        // get random char
        randomletter = 'A' + (rand()%26);
        // convert string to char
        str = string(1, randomletter);
        // add string
        input += str;
    }

    // bloom filter table
    int bloomfilterTable [2000000];

    // initialize bloomfilter array to 0
    for (int i = 0; i < 2000000; ++i)
    {
        bloomfilterTable[i]=0;
    }

    //populate the table
    string tmp="";
    int hashVal1=0;
    int hashVal2=0;
    int hashVal3=0;
    int hashVal4=0;
    for (int i=0; i<499995; i++){

        tmp = input.substr(i,5);
        hashVal1 = hashfunction1(tmp);
        hashVal2 = hashfunction2(tmp);
        hashVal3 = hashfunction3(tmp);
        hashVal4 = hashfunction4(tmp);

        populateBloomFilter(bloomfilterTable, hashVal1);
        populateBloomFilter(bloomfilterTable, hashVal2);
        populateBloomFilter(bloomfilterTable, hashVal3);
        populateBloomFilter(bloomfilterTable, hashVal4);
    }


    //create random string with length 5
   
    for(int j=0 ; j<10; j++){
        string testInput="";
        for (int i=0; i<5;i++){
            // get random char
            randomletter = 'A' + (random()%26);
            // convert string to char
            str = string(1, randomletter);
            // add string
            testInput += str;
        }

        //test
        testBloomFilter(bloomfilterTable, testInput);
    }
}

int hashfunction1 (string input)
{
    double output=0;
    double tmp=0;
    for (int i = 0; i < input.length(); ++i)
    {
        output += (input[i]*pow(10,(double)i));
    }

    output +=9678939;
    output = fmod(output, 2000000);

    return (int) output;

}

int hashfunction2 (string input)
{
    double output=0;
    double tmp=0;
    for (int i = 0; i < input.length(); ++i)
    {
        output += (input[i]*pow(10,(double)i));
    }

    output +=11111111;
    output = fmod(output,2000000);

    return (int) output;
}

int hashfunction3 (string input)
{
    double output=0;
    double tmp=0;
    for (int i = 0; i < input.length(); ++i)
    {
        output += (input[i]*pow(10,(double)i));
    }

    output +=7865432;
    output = fmod(output, 2000000);

    return (int) output;
}

int hashfunction4 (string input)
{
    double output = 0;
    double tmp = 0;
    for (int i = 0; i < input.length(); ++i)
    {
        output += (input[i]%192*pow(10,(double)i));
    }

    output = fmod(output, 2000000);

    return (int) output;
}

void populateBloomFilter (int table[], int index)
{
    if( table[index] == 0 ){
        table[index] = 1;
    }

}

void testBloomFilter(int table[], string input)
{
    int val  = hashfunction1(input);
    int val2 = hashfunction2(input);
    int val3 = hashfunction3(input);
    int val4 = hashfunction4(input);

    bool checker1 = (table[val] == 1);
    bool checker2 = (table[val2] == 1);
    bool checker3 = (table[val3] == 1);
    bool checker4 = (table[val4] == 1);

    cout << checker1 << '\t' << checker2 << '\t' << checker3 << '\t'<<  checker4 << endl;
    // above checkers are all true, input is probably in the 2000000 string
    if (checker1 && checker2 && checker3 && checker4 == true ){
        cout << input << " is probably in the 2000000 string" << endl;
    }
    else{
        cout << input << " is definitly not in the 2000000 string" << endl;
    }


}

Tuesday, November 24, 2015

c++ makefile

C++ Makefile

C++ is object orient programming.
Each object file can be created by its own .cpp and .h file.

For instance, If you have the following files

main.cpp   -----------------> main.cpp uses a object, called person
person.cpp -----------------> person object source file
person.h. -------------------> person object header file

You need to compile these three files at once.

If you would like to compile using a terminal(linux, mac os), you can type it like following
         g++ main.cpp person.cpp

It is pretty simple. right?

But how about this??
You have 100 object files that are need to be comfile.

it would be  like
       g++ main.cpp a.cpp b.cpp c.cpp d.cpp ...................
It would be really painful.. YOU need to type it every file for every compiling.

Fortunately, there is a better option, you can use Makefile

Makefile tells make, a UNIX command, to communicate with c++ compiler to link and compile a program.
If you do not know UNIX, it is ok. Think it as the father of all operacting system(window, linux, mac os and many more....). To be honest, I do not know it well either hahahaha

If you use Makefile, you need to type the compiling command just onece.
After you create Makefile, you just need to type "make" in the terminal.

Format

target : dependency
    system command

Actually, it is really simple. If you remember these two lines, you can make Makefile pretty much.
One thing you need to remember is that this file is space sensitive. make sure you tab for the second line

Example

output: main.o Person.o                         3)
    g++ main.o Person.o -o output
main.o: main.cpp                                   1)
    g++ -c main.cpp
Person.o: Person.cpp Person.h              2)
    g++ -c Person.cpp
clean:                                                      4)
    rm *.o output


1) "g++ -c main.cpp" creates "main.o"

2) "g++ -c Person.cpp" create "Person.o". Person is an object, so you need to add .h and .cpp for depedency

3) after 1) and 2) are done, Makefile will work on making output. "-o output" makes the excutable's name "output"

4) it is optional feature. It is deletion of excutable  and other .o file. After you compile, and want to delete all .o files and output. You can type "make clean". It will do the job for you.

 *excutable = file that runs progream (it is like ".exe" for window and ".jar" for java)

---------------------------------------------------------------------------------------------------------------------------
 The following is practice code. If you have suggestion or correction, please feel free to leave comment.

To compile

make

To delete .o files and output(excutable)

make clean
 

There are four files


1) MakeFile

output: main.o Person.o
    g++ main.o Person.o -o output
main.o: main.cpp
    g++ -c main.cpp
Person.o: Person.cpp Person.h
    g++ -c Person.cpp
clean:
    rm *.o output

2) main.cpp

#include <iostream>
#include <functional> // i am using mac so i need this to use namespace std
#include <string>


#include "Person.h"

using namespace std;

int main( int argc, char *argv[])
{
    Person a("Tom", "Jang");

    a.printInfo();


}

3) Person.h

#include <iostream>
#include <functional> // i am using mac so i need this to use namespace std
#include <string>

using namespace std;

class Person
{
    private:
        string firstName;
        string lastName;
    public:
    Person();
    Person( string fName, string lName );
    void printInfo();

};

4) Person.cpp

#include <iostream>
#include <functional> // i am using mac so i need this to use namespace std
#include <string>

#include "Person.h"

using namespace std;

Person :: Person()
{
    firstName = "empty";
    lastName = "empty2";
}

Person :: Person ( string fName, string lName )
{
    firstName = fName;
    lastName = lName;
}

void Person :: printInfo()
{
    cout << "firstName is " << firstName << endl;
    cout << "lastName is " << lastName << endl;
}




   

 

Saturday, November 21, 2015

c++ hashtable with linked list

C++ hashTable with linked list


Hash table is a data structure that allows use to search, edit, delete data nearlyO(1).
If you do not know what the O(1), it is ok.. keep it in you mind that it is super efficient.

Lets dive in with real example!!

let's say that you have list (you can think it as array).
Every element of this list contain name of person.
If you want to search Jane from the start of the list, you have to check every element.
Above example, we just need to check 100000 names.
Technically, it is quite fast if you are searching for 100000 names.
However, if the each elements contain lots of inforamtion, and the number of element is goes up to extrememly large numer, it is a different story. It wil take mintue, hour or more than that.

Let's look at the following.




This time, before we put names in the list, we use a function called hash function.
What the function does is that it takes name as input and return integer number.
This number is going to be the index of name in the list.

If we use this way, search can be done extremely fast!!

How?
If you want to know whether "john" is in the list, you can use hash function to the name. You will get the index, and you will get the result instantly, without looking at other names in the list.

It does not matter how many elements are in the list, you will get what you want from the list instantly

However, this hashfunction has problem...


The above example show that two different name ends with same index number.
If your list can handle onely one name, you are in trouble...
This is called hash collision.

There are severy way to avoid this.

My way is using linked list

There are different implimenation in using hash function with linked list.
My way is following...
    1) The very first element of list is empty
    2) get the hash val
    3) use hash val as index of list
    4) make linked list.

This way, it is not O(1)... but it is still close to O(1).


In the real world, there is no way that we can build hash function without hash collision.
It is like you are predict lottery number correctly every week..

Depend on number of data, field, culture and many other aspects, we  design the hash function differently; there is no right answer.

Lots of smart computer people, professors write research papers to find the best way of builiding hash function.


I made some sample coding for this topic.
---------------------------------------------------------------------------------------------------------------------------


The following is practice code. If you have suggestion or correction, please feel free to leave comment.

#include <iostream>
#include <functional> // i am using mac so i need this to use namespace std
#include <string>

using namespace std;

typedef struct student
{
    int id;
    string fName;
    string lName;
    student * nextStudent;
}Student;

void initTable (Student *table, int len);
void addStudent(int studentId, string firstName, string lastName, Student *table);
int hashFunction(int id);
void searchAndPrint(int studentId, Student *table);
void deleteStudent(Student *table, int len);

int main()
{
    Student * table = new Student [10];
    int len = 10;

    initTable(table, len);


    // insert
    addStudent(12345, "tony", "stark", table);
    addStudent(55555, "bruce", "wayne", table);

    // search and print
    searchAndPrint(55555, table);
    searchAndPrint(34524, table);


    cout << endl;
    // all students info are stored in dynamic memory
    // therefore you have to visit every single student info for the deletion
    // for those student info by the linked listed are deleted by the deleteStudent function
    // the table array is delete by delete []
    // pointers for table and pointers for student info are different pointer type.
    deleteStudent(table, len);

    delete []  table;


}

void initTable (Student *table, int len)
{
    for (int i = 0; i < len; ++i)
    {
        table[i].nextStudent = NULL;
    }
}

void addStudent(int studentId, string firstName, string lastName, Student *table)
{
    // using hash function, convert the id into hashed id
    int hashedID = hashFunction( studentId );   

        Student * pointer = &table[hashedID];

        while(pointer->nextStudent !=NULL){
            pointer = pointer->nextStudent;
        }

        // once we reach to the student who has NULL in nextStudent
        // add student
        Student *tmp = new Student;
        tmp->id = studentId;
        tmp->fName = firstName;
        tmp->lName = lastName;
        tmp->nextStudent = NULL;

        // link
        pointer->nextStudent = tmp;

}

int hashFunction(int id)
{
    int tmp = id;
    int ans =0;

        // add each digit's number ex) 12345 -> 1+2+3+4+5 = 15
        while( tmp != 0 ){
            ans += (tmp % 10);
            tmp /=10;
        }
        // the loop does not handle the last one
        // one more addition is required
        ans += (tmp % 10);
       
        // make the addtion into range of 0 to 9
        ans %= 10;

        return ans;
}

void searchAndPrint(int studentId, Student *table)
{
    // get the hashed id
    int hashedId = hashFunction(studentId);

    Student * tmp = &table[hashedId];

    // search for student
    while ( tmp -> nextStudent !=NULL){
        if (studentId == tmp->id){
            cout << "* found * " << studentId <<endl;
            cout << tmp -> fName << endl;
            cout << tmp -> lName << endl;
            return;
        }
        else{
            tmp = tmp -> nextStudent;
        }
    }

    // check the last one in the hash
    if (tmp ->id == studentId){
            cout << "* found * " << studentId << endl;
            cout << tmp -> fName << endl;
            cout << tmp -> lName << endl;
            return;
    }

    // if you cannot find even at the end
    // there is no such student in the hash
    else{
        cout << "* id is not found * " << studentId <<endl;
        return;
    }

}

void deleteStudent(Student *table, int len)
{

    for (int i = 0; i < len; i++)
    {
        Student* tmp = &table[i];
       
        // delete student info except the last one
        while ( tmp -> nextStudent != NULL){
            Student* tmp2 = tmp->nextStudent;
            tmp->nextStudent = tmp2->nextStudent;
            delete tmp2;
        }
    }



    cout << "deletrion successful" << endl;
   
}